Commit graph

412049 commits

Author SHA1 Message Date
Michal Simek
4d8981f6b7 ARM: 7871/1: amba: Extend number of IRQS
Xilinx Zynq pl330 dma driver has 9 irqs which all have to
be used by the driver to get it work properly.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-11-09 00:00:09 +00:00
Stephen Boyd
c682e51dbc ARM: 7887/1: Don't smp_cross_call() on UP devices in arch_irq_work_raise()
If we're running a kernel compiled with SMP_ON_UP=y and the
hardware only supports UP operation there isn't any
smp_cross_call function assigned. Unfortunately, we call
smp_cross_call() unconditionally in arch_irq_work_raise() and
crash the kernel on UP devices. Check to make sure we're running
on an SMP device before calling smp_cross_call() here.

Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 80000005 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.12.0-rc6-00018-g8d45144-dirty #16
task: de05b440 ti: de05c000 task.ti: de05c000
PC is at 0x0
LR is at arch_irq_work_raise+0x3c/0x48
pc : [<00000000>]    lr : [<c0019590>]    psr: 60000193
sp : de05dd60  ip : 00000001  fp : 00000000
r10: c085e2f0  r9 : de05c000  r8 : c07be0a4
r7 : de05c000  r6 : de05c000  r5 : c07c5778  r4 : c0824554
r3 : 00000000  r2 : 00000000  r1 : 00000006  r0 : c0529a58
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM Segment kernel
Control: 10c5387d  Table: 80004019  DAC: 00000017
Process swapper/0 (pid: 1, stack limit = 0xde05c248)
Stack: (0xde05dd60 to 0xde05e000)
dd60: c07b9dbc c00cb2dc 00000001 c08242c0 c08242c0 60000113 c07be0a8 c00b0590
dd80: de05c000 c085e2f0 c08242c0 c08242c0 c1414c28 c00b07cc de05b440 c1414c28
dda0: c08242c0 c00b0af8 c0862bb0 c0862db0 c1414cd8 de05c028 c0824840 de05ddb8
ddc0: 00000000 00000009 00000001 00000024 c07be0a8 c07be0a4 de05c000 c085e2f0
dde0: 00000000 c004a4b0 00000010 de00d2dc 00000054 00000100 00000024 00000000
de00: de05c028 0000000a ffff8ae7 00200040 00000016 de05c000 60000193 de05c000
de20: 00000054 00000000 00000000 00000000 00000000 c004a704 00000000 de05c008
de40: c07ba254 c004aa1c c07c5778 c0014b70 fa200000 00000054 de05de80 c0861244
de60: 00000000 c0008634 de05b440 c051c778 20000113 ffffffff de05deb4 c051d0a4
de80: 00000001 00000001 00000000 de05b440 c082afac de057ac0 de057ac0 de0443c0
dea0: 00000000 00000000 00000000 00000000 c082afbc de05dec8 c009f2a0 c051c778
dec0: 20000113 ffffffff 00000000 c016edb0 00000000 000002b0 de057ac0 de057ac0
dee0: 00000000 c016ee40 c0875e50 de05df2e de057ac0 00000000 00000013 00000000
df00: 00000000 c016f054 de043600 de0443c0 c008eb38 de004ec0 c0875e50 c008eb44
df20: 00000012 00000000 00000000 3931f0f8 00000000 00000000 00000014 c0822e84
df40: 00000000 c008ed2c 00000000 00000000 00000000 c07b7490 c07b7490 c075ab3c
df60: 00000000 c00701ac 00000002 00000000 c0070160 dffadb73 7bf8edb4 00000000
df80: c051092c 00000000 00000000 00000000 00000000 00000000 00000000 c0510934
dfa0: de05aa40 00000000 c051092c c0013ce8 00000000 00000000 00000000 00000000
dfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dfe0: 00000000 00000000 00000000 00000000 00000013 00000000 07efffe5 4dfac6f5
[<c0019590>] (arch_irq_work_raise+0x3c/0x48) from [<c00cb2dc>] (irq_work_queue+0xe4/0xf8)
[<c00cb2dc>] (irq_work_queue+0xe4/0xf8) from [<c00b0590>] (rcu_accelerate_cbs+0x1d4/0x1d8)
[<c00b0590>] (rcu_accelerate_cbs+0x1d4/0x1d8) from [<c00b07cc>] (rcu_start_gp+0x34/0x48)
[<c00b07cc>] (rcu_start_gp+0x34/0x48) from [<c00b0af8>] (rcu_process_callbacks+0x318/0x608)
[<c00b0af8>] (rcu_process_callbacks+0x318/0x608) from [<c004a4b0>] (__do_softirq+0x114/0x2a0)
[<c004a4b0>] (__do_softirq+0x114/0x2a0) from [<c004a704>] (do_softirq+0x6c/0x74)
[<c004a704>] (do_softirq+0x6c/0x74) from [<c004aa1c>] (irq_exit+0xac/0x100)
[<c004aa1c>] (irq_exit+0xac/0x100) from [<c0014b70>] (handle_IRQ+0x54/0xb4)
[<c0014b70>] (handle_IRQ+0x54/0xb4) from [<c0008634>] (omap3_intc_handle_irq+0x60/0x74)
[<c0008634>] (omap3_intc_handle_irq+0x60/0x74) from [<c051d0a4>] (__irq_svc+0x44/0x5c)
Exception stack(0xde05de80 to 0xde05dec8)
de80: 00000001 00000001 00000000 de05b440 c082afac de057ac0 de057ac0 de0443c0
dea0: 00000000 00000000 00000000 00000000 c082afbc de05dec8 c009f2a0 c051c778
dec0: 20000113 ffffffff
[<c051d0a4>] (__irq_svc+0x44/0x5c) from [<c051c778>] (_raw_spin_unlock_irq+0x28/0x2c)
[<c051c778>] (_raw_spin_unlock_irq+0x28/0x2c) from [<c016edb0>] (proc_alloc_inum+0x30/0xa8)
[<c016edb0>] (proc_alloc_inum+0x30/0xa8) from [<c016ee40>] (proc_register+0x18/0x130)
[<c016ee40>] (proc_register+0x18/0x130) from [<c016f054>] (proc_mkdir_data+0x44/0x6c)
[<c016f054>] (proc_mkdir_data+0x44/0x6c) from [<c008eb44>] (register_irq_proc+0x6c/0x128)
[<c008eb44>] (register_irq_proc+0x6c/0x128) from [<c008ed2c>] (init_irq_proc+0x74/0xb0)
[<c008ed2c>] (init_irq_proc+0x74/0xb0) from [<c075ab3c>] (kernel_init_freeable+0x84/0x1c8)
[<c075ab3c>] (kernel_init_freeable+0x84/0x1c8) from [<c0510934>] (kernel_init+0x8/0x150)
[<c0510934>] (kernel_init+0x8/0x150) from [<c0013ce8>] (ret_from_fork+0x14/0x2c)
Code: bad PC value

Fixes: bf18525fd7 "ARM: 7872/1: Support arch_irq_work_raise() via self IPIs"

Reported-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Tested-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-11-09 00:00:07 +00:00
Bart Van Assche
cd4e38542a IB/srp: Report receive errors correctly
The IB spec does not guarantee that the opcode is available in error
completions.  Hence do not rely on it.  See also commit 948d1e889e
("IB/srp: Introduce srp_handle_qp_err()").

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@vger.kernel.org> # v3.8
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:18 -08:00
Bart Van Assche
99b6697a50 IB/srp: Avoid offlining operational SCSI devices
If SCSI commands are submitted with a SCSI request timeout that is
lower than the the IB RC timeout, it can happen that the SCSI error
handler has already started device recovery before transport layer
error handling starts.  So it can happen that the SCSI error handler
tries to abort a SCSI command after it has been reset by
srp_rport_reconnect().

Tell the SCSI error handler that such commands have finished and that
it is not necessary to continue its recovery strategy for commands
that have been reset by srp_rport_reconnect().

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:18 -08:00
Vu Pham
65d7dd2f34 IB/srp: Remove target from list before freeing Scsi_Host structure
Remove an SRP target from the SRP target list before invoking the last
scsi_host_put() call.  This change is necessary because that last put
frees the memory that holds the srp_target_port structure.

This patch prevents the following kernel oops:

    RIP: 0010:[<ffffffff810b00d0>] __lock_acquire+0x500/0x1570
    Call Trace:
     [<ffffffff810b11e4>] lock_acquire+0xa4/0x120
     [<ffffffff81531206>] _spin_lock+0x36/0x70
     [<ffffffffa01b6d8f>] srp_remove_work+0xef/0x180 [ib_srp]
     [<ffffffff8109125c>] worker_thread+0x21c/0x3d0
     [<ffffffff81096e86>] kthread+0x96/0xa0
     [<ffffffff8100c20a>] child_rip+0xa/0x20

Signed-off-by: Vu Pham <vuhuong@mellanox.com>

[ bvanassche - Modified path description and CC'ed stable. ]

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:17 -08:00
Jack Wang
71444b9781 IB/srp: Add change_queue_depth and change_queue_type support
Currently, it's not possible to change queue depth for a device behind
SRP host. Sometimes, we need to adjust queue_depth for performance
reason (eg storage busy, we need lower queue_depth to avoid running
into SCSI error handler), so this patch add support for SRP driver.

Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:17 -08:00
Bart Van Assche
4d73f95f70 IB/srp: Make queue size configurable
Certain storage configurations, e.g. a sufficiently large array of
hard disks in a RAID configuration, need a queue depth above 64 to
achieve optimal performance. Hence make the queue depth configurable.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Tested-by: Jack Wang <xjtuwjp@gmail.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:17 -08:00
Bart Van Assche
b81d00bddf IB/srp: Introduce srp_alloc_req_data()
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Vu Pham <vu@mellanox.com>
Cc: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:17 -08:00
Bart Van Assche
848b3082db IB/srp: Export sgid to sysfs
On an initiator system with multiple IB ports it is not yet possible
to figure out what the originating port of an SRP connection is. Hence
make the source GID available in sysfs.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:16 -08:00
Bart Van Assche
a95cadb9da IB/srp: Add periodic reconnect functionality
After a transport layer occurred, periodically try to reconnect
to the target until the dev_loss timer expires.  Protect the
callback functions that can be invoked from inside the SCSI EH
against concurrent invocation with srp_reconnect_rport() via the
rport mutex. Change the default dev_loss_tmo from 60s into 600s
to give the reconnect mechanism a chance to kick in.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:16 -08:00
Bart Van Assche
8c64e4531c scsi_transport_srp: Add periodic reconnect support
Add support for periodically reconnecting to an SRP target until
the dev_loss timer expires. After the tenth reconnection attempt,
gradually slow down subsequent reconnect attempts.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:16 -08:00
Bart Van Assche
c1120f8981 IB/srp: Start timers if a transport layer error occurs
Start the reconnect timer, fast_io_fail timer and dev_loss timers if a
transport layer error occurs.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:15 -08:00
Bart Van Assche
ed9b2264fb IB/srp: Use SRP transport layer error recovery
Enable fast_io_fail_tmo and dev_loss_tmo functionality for the IB SRP
initiator.  Add kernel module parameters that allow to specify default
values for these parameters.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:15 -08:00
Bart Van Assche
29c1732480 scsi_transport_srp: Add transport layer error handling
Add the necessary functions in the SRP transport module to allow an
SRP initiator driver to implement transport layer error handling
similar to the functionality already provided by the FC transport
layer. This includes:

- Support for implementing fast_io_fail_tmo, the time that should
  elapse after having detected a transport layer problem and
  before failing I/O.
- Support for implementing dev_loss_tmo, the time that should
  elapse after having detected a transport layer problem and
  before removing a remote port.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:15 -08:00
Bart Van Assche
9dd69a600a IB/srp: Keep rport as long as the IB transport layer
Keep the rport data structure around after srp_remove_host() has
finished until cleanup of the IB transport layer has finished
completely. This is necessary because later patches use the rport
pointer inside the queuecommand callback. Without this patch
accessing the rport from inside a queuecommand callback is racy
because srp_remove_host() must be invoked before scsi_remove_host()
and because the queuecommand callback could get invoked after
srp_remove_host() has finished. In other words, without this patch
the queuecommand callback can get invoked after the rport data
structure has been freed.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:15 -08:00
Vu Pham
7bb312e4a2 IB/srp: Make transport layer retry count configurable
Allow the InfiniBand RC retry count to be configured by the user as an
option in the target login string.  Reducing this retry count allows to
reduce the path failover time.

Signed-off-by: Vu Pham <vu@mellanox.com>

[ bvanassche: Rewrote patch description / changed default retry count ]

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: David Dillow <dillowda@ornl.gov>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:15 -08:00
Mike Marciniszyn
2fadd83184 IB/qib: Fix txselect regression
Commit 7fac33014f54("IB/qib: checkpatch fixes") was overzealous in
removing a simple_strtoul for a parse routine, setup_txselect().  That
routine is required to handle a multi-value string.

Unwind that aspect of the fix.

Cc: <stable@vger.kernel.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:12 -08:00
Mike Marciniszyn
78a5886472 IB/qib: Fix checkpatch __packed warnings
Convert __attribute__ ((packed)) to __packed.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:12 -08:00
Jan Kara
603e772992 IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast()
qib_user_sdma_queue_pkts() gets called with mmap_sem held for
writing. Except for get_user_pages() deep down in
qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all.  Even
more interestingly the function qib_user_sdma_queue_pkts() (and also
qib_user_sdma_coalesce() called somewhat later) call copy_from_user()
which can hit a page fault and we deadlock on trying to get mmap_sem
when handling that fault.

So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and
leave mmap_sem locking for mm.

This deadlock has actually been observed in the wild when the node
is under memory pressure.

Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:11 -08:00
Jan Kara
4adcf7fb67 IB/ipath: Convert ipath_user_sdma_pin_pages() to use get_user_pages_fast()
ipath_user_sdma_queue_pkts() gets called with mmap_sem held for
writing.  Except for get_user_pages() deep down in
ipath_user_sdma_pin_pages() we don't seem to need mmap_sem at all.

Even more interestingly the function ipath_user_sdma_queue_pkts() (and
also ipath_user_sdma_coalesce() called somewhat later) call
copy_from_user() which can hit a page fault and we deadlock on trying
to get mmap_sem when handling that fault.  So just make
ipath_user_sdma_pin_pages() use get_user_pages_fast() and leave
mmap_sem locking for mm.

This deadlock has actually been observed in the wild when the node
is under memory pressure.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>

[ Merged in fix for call to get_user_pages_fast from Tetsuo Handa
  <penguin-kernel@I-love.SAKURA.ne.jp>.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:11 -08:00
Naresh Gottumukkala
d5e3f37833 RDMA/ocrdma: Remove redundant check in ocrdma_build_fr()
Remove the redundant check of comparing if a 32-bit value is greater
than 0xffffffffULL.

Reported by Dan Carpenter.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:06 -08:00
Naresh Gottumukkala
1852d1da3b RDMA/ocrdma: Fix a crash in rmmod
1) ocrdma_remove_free() is called from a call_rcu callback funtion
   context, which can be a bottom-half context. So the code in
   ocrdma_remove_free should not sleep.

   But ocrdma_cleanup_hw() can sleep, So move it ocrdma_remove()
   instead of ocrdma_remove_free.

2) Fix a couple of kbuild test robot warnings.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:06 -08:00
Dan Carpenter
6ebacdfc07 RDMA/ocrdma: Silence an integer underflow warning
We recently added a cap on "max_wqe_allocated" in 43a6b4025c
('RDMA/ocrdma: Create IRD queue fix').

My static checker complains that the cap has a problem because it
casts large values to negative.  "attrs->cap.max_send_wr" is a u32.
It comes from the user, but it's capped in ocrdma_check_qp_params() so
it can't wrap here.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:05 -08:00
Eli Cohen
1b77d2bd75 mlx5: Use enum to indicate adapter page size
The Connect-IB adapter has an inherent page size which equals 4K.
Define an new enum that equals the page shift and use it instead of
using the value 12 throughout the code.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:01 -08:00
Eli Cohen
c2a3431e61 IB/mlx5: Update opt param mask for RTS2RTS
RTS to RTS transition should allow update of alternate path.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:01 -08:00
Eli Cohen
07c9113fe8 IB/mlx5: Remove "Always false" comparison
mlx5_cur and mlx5_new cannot have negative values so remove the
redundant condition.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:01 -08:00
Eli Cohen
2d036fad94 IB/mlx5: Remove dead code in mr.c
In mlx5_mr_cache_init() the size variable is not used so remove it to
avoid compiler warnings when running with make W=1.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Moshe Lazer
4e3d677ba9 mlx5_core: Change optimal_reclaimed_pages for better performance
Change optimal_reclaimed_pages() to increase the output size of each
reclaim pages command. This change reduces significantly the amount of
reclaim pages commands issued to FW when the driver is unloaded which
reduces the overall driver unload time.

Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Eli Cohen
87b8de492d mlx5: Clear reserved area in set_hca_cap()
Firmware spec requires reserved fields to be cleared when calling
set_hca_cap.  Current code queries and copy to the set area, possibly
resulting in reserved bits not cleared. This patch copies only
writable fields to the set area.

Fix also typo - msx => max

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Eli Cohen
bf0bf77f65 mlx5: Support communicating arbitrary host page size to firmware
Connect-IB firmware requires 4K pages to be communicated with the
driver. This patch breaks larger pages to 4K units to enable support
for architectures utilizing larger page size, such as PowerPC.  This
patch also fixes several places that referred to PAGE_SHIFT instead of
explicit 12 which is the inherent page shift on Connect-IB.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Eli Cohen
952f5f6e80 mlx5: Fix cleanup flow when DMA mapping fails
If DMA mapping fails, the driver cleared the object that holds the
previously DMA mapped pages. Fix this by allocating a new object for
the command that reports back to firmware that pages can't be
supplied.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:43:00 -08:00
Moshe Lazer
cfd8f1d49b IB/mlx5: Fix srq free in destroy qp
On destroy QP the driver walks over the relevant CQ and removes CQEs
reported for the destroyed QP.  It also frees the related SRQ entry
without checking that this is actually an SRQ-related CQE.  In case of
a CQ used for both send and receive QP, we could free SRQ entries for
send CQEs.  This patch resolves this issue by verifying that this is a
SRQ related CQE by checking the SRQ number in the CQE is not zero.

Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
1faacf82df IB/mlx5: Simplify mlx5_ib_destroy_srq
Make use of destroy_srq_kernel() to clear SRQ resouces.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
9641b74ebe IB/mlx5: Fix overflow check in IB_WR_FAST_REG_MR
Make sure not to overflow when reading the page list from struct
ib_fast_reg_page_list.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
746b5583c1 IB/mlx5: Multithreaded create MR
Use asynchronous commands to execute up to eight concurrent create MR
commands. This is to fill memory caches faster so we keep consuming
from there.  Also, increase timeout for shrinking caches to five
minutes.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:59 -08:00
Eli Cohen
51ee86a4af IB/mlx5: Fix check of number of entries in create CQ
Verify that the value is non negative before rounding up to power of 2.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:58 -08:00
Mathias Krause
5476781bb9 IB/netlink: Remove superfluous RDMA_NL_GET_OP() masking
'op' is the already RDMA_NL_GET_OP() masked 'type'.  No need to mask it again.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:54 -08:00
Latchesar Ionkov
6b7d103c1b IB/core: Pass imm_data from ib_uverbs_send_wr to ib_send_wr correctly
Currently, we don't copy the immediate data from the userspace struct
to the kernel one when UD messages are being sent.

This patch makes sure that the immediate data is set correctly.

Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:54 -08:00
Michal Schmidt
7f1a38671c IPoIB: lower NAPI weight
Since commit 82dc3c63c6 ("net: introduce NAPI_POLL_WEIGHT")
netif_napi_add() produces an error message if a NAPI poll weight
greater than 64 is requested.

Use the standard NAPI weight.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:50 -08:00
Erez Shitrit
94232d9ce8 IPoIB: Start multicast join process only on active ports
The driver starts the mcast_join task whenever the netdev interface is
UP without relation to the underlying IB port state.

Until the port state is ACTIVE all the join requests are irrelevant,
and the IB core returns -EINVAL. So the user will see errors such as:
"multicast join failed for ff12:401b:... , status -22".

Instead, have ipoib_mcast_join_task() return when the port is not active.

It will be called again when the port state is changed and the
low-level driver triggers the IB_EVENT_PORT_ACTIVE event or the
IB_EVENT_CLIENT_REREGISTER event.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
a39c52ab88 IPoIB: Add path query flushing in ipoib_ib_dev_cleanup
The path_rec_completion() callback may be invoked asynchronously even
at the middle of "driver uninit" process.  This can lead to scheduling
a task that tries to touch members of the priv object that are no
longer valid.  For example the function cm_create_tx_qp can attempt to
create qp with no valid priv->pd object.

The following crash is one of the results:
RIP: 0010:[<ffffffffa021bb47>]  [<ffffffffa021bb47>] ipoib_cm_create_tx_qp+0x57/0x90 [ib_ipoib]
Process ipoib (pid: 5916, threadinfo ffff8803786e4000, task ffff8804150e1500)
Stack:
Call Trace:
[<ffffffff81309ef0>] ? get_random_bytes+0x20/0x30
[<ffffffffa021be2a>] ipoib_cm_tx_init+0xca/0x340 [ib_ipoib]
[<ffffffffa021f765>] ipoib_cm_tx_start+0x215/0x3f0 [ib_ipoib]
[<ffffffffa021f550>] ? ipoib_cm_tx_start+0x0/0x3f0 [ib_ipoib]
[<ffffffff8108b2b0>] worker_thread+0x170/0x2a0
[<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff8108b140>] ? worker_thread+0x0/0x2a0
[<ffffffff81090886>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff810907f0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20

Fix that by flushing all pending path queries at this point.

Signed-off-by: Alex Markuze <markuze@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
a9c8ba5884 IPoIB: Fix usage of uninitialized multicast objects
The driver should avoid calling ib_sa_free_multicast on the mcast->mc
object until it finishes its initialization state.  Otherwise we can
crash when ipoib_mcast_dev_flush() attempts to use the uninitialized
multicast object.

Instead, only call wait_for_completion() for multicast entries that
started the join process, meaning that ib_sa_join_multicast() finished.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
aede25011f IPoIB: Avoid flushing the driver workqueue on dev_down
The driver should not flush the whole workqueue when only one work (the
pkey poll one) needs to be cancelled.  Use cancel_delayed_work_sync()
instead.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Erez Shitrit
f47944cc2d IPoIB: Fix deadlock between dev_change_flags() and __ipoib_dev_flush()
When ipoib interface is going down it takes all of its children with
it, under mutex.

For each child, dev_change_flags() is called.  That function calls
ipoib_stop() via the ndo, and causes flush of the workqueue.
Sometimes in the workqueue an __ipoib_dev_flush work() is waiting and
when invoked tries to get the same mutex, which leads to a deadlock,
as seen below.

The solution is to switch to rw-sem instead of mutex.

The deadlock:
[11028.165303]  [<ffffffff812b0977>] ? vgacon_scroll+0x107/0x2e0
[11028.171844]  [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0
[11028.178465]  [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
[11028.185962]  [<ffffffff814ea743>] wait_for_common+0x123/0x180
[11028.192491]  [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
[11028.199504]  [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20
[11028.206224]  [<ffffffff8108b4f1>] flush_cpu_workqueue+0x61/0x90
[11028.212948]  [<ffffffff8108b5a0>] ? wq_barrier_func+0x0/0x20
[11028.219375]  [<ffffffff8108bfc4>] flush_workqueue+0x54/0x80
[11028.225712]  [<ffffffffa05a0576>] ipoib_mcast_stop_thread+0x66/0x90 [ib_ipoib]
[11028.233988]  [<ffffffffa059ccea>] ipoib_ib_dev_down+0x6a/0x100 [ib_ipoib]
[11028.241678]  [<ffffffffa059849a>] ipoib_stop+0x8a/0x140 [ib_ipoib]
[11028.248692]  [<ffffffff8142adf1>] dev_close+0x71/0xc0
[11028.254447]  [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0
[11028.261062]  [<ffffffffa059851b>] ipoib_stop+0x10b/0x140 [ib_ipoib]
[11028.268172]  [<ffffffff8142adf1>] dev_close+0x71/0xc0
[11028.273922]  [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0
[11028.280452]  [<ffffffff8148f20b>] devinet_ioctl+0x5eb/0x6a0
[11028.286786]  [<ffffffff814903b8>] inet_ioctl+0x88/0xa0
[11028.292633]  [<ffffffff8141591a>] sock_ioctl+0x7a/0x280
[11028.298576]  [<ffffffff81189012>] vfs_ioctl+0x22/0xa0
[11028.304326]  [<ffffffff81140540>] ? unmap_region+0x110/0x130
[11028.310756]  [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580
[11028.316897]  [<ffffffff81189731>] sys_ioctl+0x81/0xa0

and

11028.017533]  [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
[11028.025030]  [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
[11028.031945]  [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
[11028.039053]  [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
[11028.044910]  [<ffffffffa059f7e7>] __ipoib_ib_dev_flush+0x37/0x210 [ib_ipoib]
[11028.052894]  [<ffffffffa059fa00>] ? ipoib_ib_dev_flush_light+0x0/0x20 [ib_ipoib]
[11028.061363]  [<ffffffffa059fa17>] ipoib_ib_dev_flush_light+0x17/0x20 [ib_ipoib]
[11028.069738]  [<ffffffff8108b120>] worker_thread+0x170/0x2a0
[11028.076068]  [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
[11028.083374]  [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
[11028.089709]  [<ffffffff81090626>] kthread+0x96/0xa0
[11028.095266]  [<ffffffff8100c0ca>] child_rip+0xa/0x20
[11028.100921]  [<ffffffff81090590>] ? kthread+0x0/0xa0
[11028.106573]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
[11028.112423] INFO: task ifconfig:23640 blocked for more than 120 seconds.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:49 -08:00
Tal Alon
22252b4e09 IPoIB: Change CM skb memory allocation to be non-atomic during init
Change CM skb memory allocation to use GFP_KERNEL when possible.

During device init there's no need to use GFP_ATOMIC when allocating
memory for the CM skbs -- use GFP_KERNEL instead.

Signed-off-by: Tal Alon <talal@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:48 -08:00
Erez Shitrit
c2bb5628db IPoIB: Fix crash in dev_open error flow
If napi has never been enabled when calling ipoib_ib_dev_stop, a
kernel crash occurs, because the verbs layer completion handler
(ipoib_ib_completion) calls napi_schedule unconditionally.

If the napi structure passed in the napi_schedule call has not
been initialized, napi will crash.

The cleanest solution is to simply enable napi before calling
ipoib_ib_dev_stop in the dev_open error flow. (dev_stop then
immediately disables napi).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:48 -08:00
Ben Hutchings
649fb5ec0e IB/cxgb4: Fix formatting of physical address
Physical addresses may be wider than virtual addresses (e.g. on i386
with PAE) and must not be formatted with %p.

Compile-tested only.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:30 -08:00
Doug Ledford
be9130cc92 IB/cma: Check for GID on listening device first
As a simple optimization that should speed up the vast majority of
connect attemps on IB devices, when we are searching for the GID of an
incoming connection in the cached GID lists of devices, search the
device that received the incoming connection request first.  If we
don't find it there, then move on to other devices.

This reduces the time to perform 10,000 connections considerably.
Prior to this patch, a bad run of cmtime would look like this:

connect      :    12399.26   12351.10    8609.00    1239.93

With this patch, it looks more like this:

connect      :     5864.86    5799.80    8876.00     586.49

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:24 -08:00
Doug Ledford
29f27e8477 IB/cma: Use cached gids
The cma_acquire_dev function was changed by commit 3c86aa70bf
("RDMA/cm: Add RDMA CM support for IBoE devices") to use find_gid_port()
because multiport devices might have either IB or IBoE formatted gids.
The old function assumed that all ports on the same device used the
same GID format.

However, when it was changed to use find_gid_port(), we inadvertently
lost usage of the GID cache.  This turned out to be a very costly
change.  In our testing, each iteration through each index of the GID
table takes roughly 35us.  When you have multiple devices in a system,
and the GID you are looking for is on one of the later devices, the
code loops through all of the GID indexes on all of the early devices
before it finally succeeds on the target device.  This pathological
search behavior combined with 35us per GID table index retrieval
results in results such as the following from the cmtime application
that's part of the latest librdmacm git repo:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       29.42       0.04       1.00       2.94
bind addr    :   186705.66      19.00   18556.00   18670.57
resolve addr :       41.93       9.68     619.00       4.19
resolve route:      486.93       0.48     101.00      48.69
create qp    :     4021.95       6.18     330.00     402.20
connect      :    68350.39   68588.17   24632.00    6835.04
disconnect   :     1460.43     252.65-1862269.00     146.04
destroy      :       41.16       0.04       2.00       4.12

ib0:
step              total ms     max ms     min us  us / conn
create id    :       28.61       0.68       1.00       2.86
bind addr    :     2178.86       2.95     201.00     217.89
resolve addr :       51.26      16.85     845.00       5.13
resolve route:      620.08       0.43      92.00      62.01
create qp    :     3344.40       6.36     273.00     334.44
connect      :     6435.99    6368.53    7844.00     643.60
disconnect   :     5095.38     321.90     757.00     509.54
destroy      :       37.13       0.02       2.00       3.71

Clearly, both the bind address and connect operations suffer
a huge penalty for being anything other than the default
GID on the first port in the system.

After applying this patch, the numbers now look like this:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       30.15       0.03       1.00       3.01
bind addr    :       80.27       0.04       7.00       8.03
resolve addr :       43.02      13.53     589.00       4.30
resolve route:      482.90       0.45     100.00      48.29
create qp    :     3986.55       5.80     330.00     398.66
connect      :     7141.53    7051.29    5005.00     714.15
disconnect   :     5038.85     193.63     918.00     503.88
destroy      :       37.02       0.04       2.00       3.70

ib0:
step              total ms     max ms     min us  us / conn
create id    :       34.27       0.05       1.00       3.43
bind addr    :       26.45       0.04       1.00       2.64
resolve addr :       38.25      10.54     760.00       3.82
resolve route:      604.79       0.43      97.00      60.48
create qp    :     3314.95       6.34     273.00     331.49
connect      :    12399.26   12351.10    8609.00    1239.93
disconnect   :     5096.76     270.72    1015.00     509.68
destroy      :       37.10       0.03       2.00       3.71

It's worth noting that we still suffer a bit of a penalty on
connect to the wrong device, but the penalty is much less than
it used to be.  Follow on patches deal with this penalty.

Many thanks to Neil Horman for helping to track the source of
slow function that allowed us to track down the fact that
the original patch I mentioned above backed out cache usage
and identify just how much that impacted the system.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08 14:42:24 -08:00
Trond Myklebust
a6b31d18b0 SUNRPC: Fix a data corruption issue when retransmitting RPC calls
The following scenario can cause silent data corruption when doing
NFS writes. It has mainly been observed when doing database writes
using O_DIRECT.

1) The RPC client uses sendpage() to do zero-copy of the page data.
2) Due to networking issues, the reply from the server is delayed,
   and so the RPC client times out.

3) The client issues a second sendpage of the page data as part of
   an RPC call retransmission.

4) The reply to the first transmission arrives from the server
   _before_ the client hardware has emptied the TCP socket send
   buffer.
5) After processing the reply, the RPC state machine rules that
   the call to be done, and triggers the completion callbacks.
6) The application notices the RPC call is done, and reuses the
   pages to store something else (e.g. a new write).

7) The client NIC drains the TCP socket send buffer. Since the
   page data has now changed, it reads a corrupted version of the
   initial RPC call, and puts it on the wire.

This patch fixes the problem in the following manner:

The ordering guarantees of TCP ensure that when the server sends a
reply, then we know that the _first_ transmission has completed. Using
zero-copy in that situation is therefore safe.
If a time out occurs, we then send the retransmission using sendmsg()
(i.e. no zero-copy), We then know that the socket contains a full copy of
the data, and so it will retransmit a faithful reproduction even if the
RPC call completes, and the application reuses the O_DIRECT buffer in
the meantime.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-11-08 17:19:15 -05:00