Commit graph

4,040 commits

Author SHA1 Message Date
Naresh Gottumukkala
cffce99051 RDMA/ocrdma: Dont use PD 0 for userpace CQ DB
Create_CQ verb doesn't provide a PD pointer.  So, until now we are
creating all (both userspace and kernel) CQ DB regions from PD0.  This
will result in mmapping PD0 to applications.  A rogue userspace
application can mess things up.

Also more serious issues is even the be2net NIC uses PD0.

This patch addresses this problem by:

1) Create a PD page for every userspace application when the
   alloc_ucontext is called. This will be destroyed in
   dealloc_ucontext.
2) All CQs for that context will use the PD allocated in ucontext.
3) The first create_PD call from application will result in returning
   the PD address from its ucontext (no new PD will be created).
4) For subsecquent create_pd calls from application, we create new PDs for
   the application.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:18:32 -07:00
Naresh Gottumukkala
2b51a9b9eb RDMA/ocrdma: FRMA code cleanup
1) Fixed setting FR_MR bit for FRWR stag allocation
2) Access rights are passsed during FRWR stage and not during STAT allocation stage
3) FRWR WQE structure cleanup
4) Add QP level signaled bit.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:17:56 -07:00
Naresh Gottumukkala
f11220ee69 RDMA/ocrdma: For ERX2 irrespective of Qid, num_posted offset is 24
1) All RQ doorbells are handled by ERX2 and doorbell->num_posted
   offset is constant to bit offset 24 for ERX2 irrspective of Q id.

2) Fixed RESET to INIT state change (from ERR->RST->INIT->RTR case).

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:17:55 -07:00
Naresh Gottumukkala
c88bd03ffc RDMA/ocrdma: Fix to work with even a single MSI-X vector
There are cases like SRIOV where can get only one MSI-X vector
allocated for RoCE.  In that case we need to use the vector for both
data plane and control plane.  We need to use EQ create version V2.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:17:54 -07:00
Naresh Gottumukkala
d3cb6c0b2a RDMA/ocrdma: Remove the MTU check based on Ethernet MTU
Also increase MAX AH to 512.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:17:53 -07:00
Naresh Gottumukkala
7c33880c3c RDMA/ocrdma: Add support for fast register work requests (FRWR)
Also get the max_srq value from query_config mailbox response.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:17:48 -07:00
Naresh Gottumukkala
43a6b4025c RDMA/ocrdma: Create IRD queue fix
1)	Fix ocrdma_get_num_posted_shift for upto 128 QPs.
2)	Create for min of dev->max_wqe and requested wqe in create_qp.
3)	As part of creating ird queue, populate with basic header templates.
4)	Make sure all the DB memory allocated to userspace are page aligned.
5)	Fix issue in checking the mmap local cache.
6)	Some code cleanup.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 21:16:21 -07:00
Matan Barak
22878dbc91 IB/core: Better checking of userspace values for receive flow steering
- Don't allow unsupported comp_mask values, user should check
    ibv_query_device to know which features are supported.
  - Add a check in ib_uverbs_create_flow() to verify the size passed
    from the user space.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-09-02 11:12:48 -07:00
Hadar Hen Zion
f77c0162a3 IB/mlx4: Add receive flow steering support
Implement ib_create_flow() and ib_destroy_flow().

Translate the verbs structures provided by the user to HW structures
and call the MLX4_QP_FLOW_STEERING_ATTACH/DETACH firmware commands.

On the ATTACH command completion, the firmware provides a 64-bit
registration ID, which is placed into struct mlx4_ib_flow that wraps
the instance of struct ib_flow which is retuned to caller.  Later,
this reg ID is used for detaching that flow from the firmware.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-28 09:53:56 -07:00
Hadar Hen Zion
436f2ad05a IB/core: Export ib_create/destroy_flow through uverbs
Implement ib_uverbs_create_flow() and ib_uverbs_destroy_flow() to
support flow steering for user space applications.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-28 09:53:14 -07:00
Igor Ivanov
400dbc9658 IB/core: Infrastructure for extensible uverbs commands
Add infrastructure to support extended uverbs capabilities in a
forward/backward manner.  Uverbs command opcodes which are based on
the verbs extensions approach should be greater or equal to
IB_USER_VERBS_CMD_THRESHOLD.  They have new header format and
processed a bit differently.

Whenever a specific IB_USER_VERBS_CMD_XXX is extended, which practically means
it needs to have additional arguments, we will be able to add them without creating
a completely new IB_USER_VERBS_CMD_YYY command or bumping the uverbs ABI version.

This patch for itself doesn't provide the whole scheme which is also dependent
on adding a comp_mask field to each extended uverbs command struct.

The new header framework allows for future extension of the CMD arguments
(ib_uverbs_cmd_hdr.in_words, ib_uverbs_cmd_hdr.out_words) for an existing
new command (that is a command that supports the new uverbs command header format
suggested in this patch) w/o bumping ABI version and with maintaining backward
and formward compatibility to new and old libibverbs versions.

In the uverbs command we are passing both uverbs arguments and the provider arguments.
We split the ib_uverbs_cmd_hdr.in_words to ib_uverbs_cmd_hdr.in_words which will now carry only
uverbs input argument struct size and  ib_uverbs_cmd_hdr.provider_in_words that will carry
the provider input argument size. Same goes for the response (the uverbs CMD output argument).

For example take the create_cq call and the mlx4_ib provider:

The uverbs layer gets libibverb's struct ibv_create_cq (named struct ib_uverbs_create_cq
in the kernel), mlx4_ib gets libmlx4's struct mlx4_create_cq (which includes struct
ibv_create_cq and is named struct mlx4_ib_create_cq in the kernel) and
in_words = sizeof(mlx4_create_cq)/4 .

Thus ib_uverbs_cmd_hdr.in_words carry both uverbs plus mlx4_ib input argument sizes,
where uverbs assumes it knows the size of its input argument - struct ibv_create_cq.

Now, if we wish to add a variable to struct ibv_create_cq, we can add a comp_mask field
to the struct which is basically bit field indicating which fields exists in the struct
(as done for the libibverbs API extension), but we need a way to tell what is the total
size of the struct and not assume the struct size is predefined (since we may get different
struct sizes from different user libibverbs versions). So we know at which point the
provider input argument (struct mlx4_create_cq) begins. Same goes for extending the
provider struct mlx4_create_cq. Thus we split the ib_uverbs_cmd_hdr.in_words to
ib_uverbs_cmd_hdr.in_words which will now carry only uverbs input argument struct size and
ib_uverbs_cmd_hdr.provider_in_words that will carry the provider (mlx4_ib) input argument size.

Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-28 09:52:03 -07:00
Hadar Hen Zion
319a441d13 IB/core: Add receive flow steering support
The RDMA stack allows for applications to create IB_QPT_RAW_PACKET
QPs, which receive plain Ethernet packets, specifically packets that
don't carry any QPN to be matched by the receiving side.  Applications
using these QPs must be provided with a method to program some
steering rule with the HW so packets arriving at the local port can be
routed to them.

This patch adds ib_create_flow(), which allow providing a flow
specification for a QP.  When there's a match between the
specification and a received packet, the packet is forwarded to that
QP, in a the same way one uses ib_attach_multicast() for IB UD
multicast handling.

Flow specifications are provided as instances of struct ib_flow_spec_yyy,
which describe L2, L3 and L4 headers.  Currently specs for Ethernet, IPv4,
TCP and UDP are defined.  Flow specs are made of values and masks.

The input to ib_create_flow() is a struct ib_flow_attr, which contains
a few mandatory control elements and optional flow specs.

    struct ib_flow_attr {
            enum ib_flow_attr_type type;
            u16      size;
            u16      priority;
            u32      flags;
            u8       num_of_specs;
            u8       port;
            /* Following are the optional layers according to user request
             * struct ib_flow_spec_yyy
             * struct ib_flow_spec_zzz
             */
    };

As these specs are eventually coming from user space, they are defined and
used in a way which allows adding new spec types without kernel/user ABI
change, just with a little API enhancement which defines the newly added spec.

The flow spec structures are defined with TLV (Type-Length-Value)
entries, which allows calling ib_create_flow() with a list of variable
length of optional specs.

For the actual processing of ib_flow_attr the driver uses the number
of specs and the size mandatory fields along with the TLV nature of
the specs.

Steering rules processing order is according to the domain over which
the rule is set and the rule priority.  All rules set by user space
applicatations fall into the IB_FLOW_DOMAIN_USER domain, other domains
could be used by future IPoIB RFS and Ethetool flow-steering interface
implementation.  Lower numerical value for the priority field means
higher priority.

The returned value from ib_create_flow() is a struct ib_flow, which
contains a database pointer (handle) provided by the HW driver to be
used when calling ib_destroy_flow().

Applications that offload TCP/IP traffic can also be written over IB
UD QPs.  The ib_create_flow() / ib_destroy_flow() API is designed to
support UD QPs too.  A HW driver can set IB_DEVICE_MANAGED_FLOW_STEERING
to denote support for flow steering.

The ib_flow_attr enum type supports usage of flow steering for promiscuous
and sniffer purposes:

    IB_FLOW_ATTR_NORMAL - "regular" rule, steering according to rule specification

    IB_FLOW_ATTR_ALL_DEFAULT - default unicast and multicast rule, receive
        all Ethernet traffic which isn't steered to any QP

    IB_FLOW_ATTR_MC_DEFAULT - same as IB_FLOW_ATTR_ALL_DEFAULT but only for multicast

    IB_FLOW_ATTR_SNIFFER - sniffer rule, receive all port traffic

ALL_DEFAULT and MC_DEFAULT rules options are valid only for Ethernet link type.

Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-28 09:51:52 -07:00
Or Gerlitz
6a06a4b8cf [SCSI] IB/iser: Add Discovery support
To run discovery over iSER we need to advertize the CAP_TEXT_NEGO capability
towards user space. Also need to make sure the login RX buffer is posted when
SendTargets TEXT PDUs are sent. For that end, we use a setting of the
ISCSI_PARAM_DISCOVERY_SESS iscsi param as an indication that this is
discovery session.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2013-08-26 18:53:49 +04:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Jim Foraker
49b8e74438 IPoIB: Fix race in deleting ipoib_neigh entries
In several places, this snippet is used when removing neigh entries:

	list_del(&neigh->list);
	ipoib_neigh_free(neigh);

The list_del() removes neigh from the associated struct ipoib_path, while
ipoib_neigh_free() removes neigh from the device's neigh entry lookup
table.  Both of these operations are protected by the priv->lock
spinlock.  The table however is also protected via RCU, and so naturally
the lock is not held when doing reads.

This leads to a race condition, in which a thread may successfully look
up a neigh entry that has already been deleted from neigh->list.  Since
the previous deletion will have marked the entry with poison, a second
list_del() on the object will cause a panic:

  #5 [ffff8802338c3c70] general_protection at ffffffff815108c5
     [exception RIP: list_del+16]
     RIP: ffffffff81289020  RSP: ffff8802338c3d20  RFLAGS: 00010082
     RAX: dead000000200200  RBX: ffff880433e60c88  RCX: 0000000000009e6c
     RDX: 0000000000000246  RSI: ffff8806012ca298  RDI: ffff880433e60c88
     RBP: ffff8802338c3d30   R8: ffff8806012ca2e8   R9: 00000000ffffffff
     R10: 0000000000000001  R11: 0000000000000000  R12: ffff8804346b2020
     R13: ffff88032a3e7540  R14: ffff8804346b26e0  R15: 0000000000000246
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
  #6 [ffff8802338c3d38] ipoib_cm_tx_handler at ffffffffa066fe0a [ib_ipoib]
  #7 [ffff8802338c3d98] cm_process_work at ffffffffa05149a7 [ib_cm]
  #8 [ffff8802338c3de8] cm_work_handler at ffffffffa05161aa [ib_cm]
  #9 [ffff8802338c3e38] worker_thread at ffffffff81090e10
 #10 [ffff8802338c3ee8] kthread at ffffffff81096c66
 #11 [ffff8802338c3f48] kernel_thread at ffffffff8100c0ca

We move the list_del() into ipoib_neigh_free(), so that deletion happens
only once, after the entry has been successfully removed from the lookup
table.  This same behavior is already used in ipoib_del_neighs_by_gid()
and __ipoib_reap_neigh().

Signed-off-by: Jim Foraker <foraker1@llnl.gov>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:57:37 -07:00
Steve Wise
09992579bc RDMA/cxgb4: Issue RI.FINI before closing when entering TERM
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:49 -07:00
Steve Wise
a2de1499b3 RDMA/cxgb4: Advertise ~0ULL as max MR size
Lustre uses a advertised max MR size of ~0ULL to indicate it should
use a dma_mr.  Hence advertise max MR size as ~0ULL.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:48 -07:00
Steve Wise
b298881fcf RDMA/cxgb4: Always do GTS write if cidx_inc == CIDXINC_MASK
When polling, we do a GTS update if the accumulated cidx_inc == the CQ
depth / 16.  However, if the CQ is large enough, Cq depth / 16 exceeds
the size of the field in the GTS word.  So we also need to update if
cidx_inc hits CIDXINC_MASK to avoid overflowing the field.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:47 -07:00
Steve Wise
b38a0ad8ec RDMA/cxgb4: Set arp error handler for PASS_ACCEPT_RPL messages
accept_cr() failed to set the arp error handler on a reused skb.  This
results in a kernel crash if the arp does indeed time out.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:47 -07:00
Steve Wise
27ca34f54a RDMA/cxgb4: Fix accounting for unsignaled SQ WRs to deal with wrap
When determining how many WRs are completed with a signaled CQE,
correctly deal with queue wraps.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:46 -07:00
Steve Wise
1cf24dcef4 RDMA/cxgb4: Fix QP flush logic
This patch makes following fixes in QP flush logic:

- correctly flushes unsignaled WRs followed by a signaled WR
- supports for flushing a CQ bound to multiple QPs
- resets cidx_flush if a active queue starts getting HW CQEs again
- marks WQ in error when we leave RTS. This was only being done for
  user queues, but we need it for kernel queues too so that
  post_send/post_recv will start returning the appropriate error
  synchronously
- eats unsignaled read resp CQEs. HW always inserts CQEs so we must
  silently discard them if the read work request was unsignaled.
- handles QP flushes with pending SW CQEs. The flush and out of order
  completion logic has a bug where if out of order completions are
  flushed but not yet polled by the consumer and the qp is then
  flushed then we end up inserting duplicate completions.
- c4iw_flush_sq() should only flush wrs that have not already been
  flushed.  Since we already track where in the SQ we've flushed via
  sq.cidx_flush, just start at that point and flush any remaining.
  This bug only caused a problem in the presence of unsignaled work
  requests.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>

[ Fixed sparse warning due to htonl/ntohl confusion.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:45 -07:00
Steve Wise
97d7ec0c41 RDMA/cxgb4: Handle newer firmware changes
Move QP to TERMINATE instead to allow the peer to get the TERM
message. This bug wasn't detectable until newer FW that moves
connections out of RDMA mode as soon as an error is detected.

QP can exit RTS before the last AE arrives.  This was introduced by
changes in the FW to kick connections out of RDMA mode as soon as an
error is detected.  A side effect of this is that the driver can move
the QP out of RTS before the AE causing the connection to get kicked
out of RDMA mode is processed.  Fix for this is to always post async
errors even if the QP is out of RTS.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:44 -07:00
Steve Wise
68074bb1ab RDMA/cxgb4: Use correct bit shift macros for vlan filter tuples
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:43 -07:00
Vipul Pandya
830662f6f0 RDMA/cxgb4: Add support for active and passive open connection with IPv6 address
Add new cpl messages, cpl_act_open_req6 and cpl_t5_act_open_req6, for
initiating active open connections.

Use LLD api cxgb4_create_server and cxgb4_create_server6 for
initiating passive open connections. Similarly use cxgb4_remove_server
to remove the passive open connections in place of listen_stop.

Add support for iWARP over VLAN device and enable IPv6 support on VLAN device.

Make use of import_ep in c4iw_reconnect.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>

[ Fix build when IPv6 is disabled and make sure iw_cxgb4 is not built-in
  when ipv6 is a module.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:55:06 -07:00
Yishai Hadas
846be90d81 IB/core: Fixes to XRC reference counting in uverbs
Added reference counting mechanism for XRC target QPs between
ib_uqp_object and its ib_uxrcd_object.  This prevents closing an XRC
domain that is still attached to a QP.  In addition, add missing code
in ib_uverbs_destroy_srq() to handle ib_uxrcd_object reference
counting correctly when destroying an xsrq.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:21:32 -07:00
Yishai Hadas
73c40c616a IB/core: Add locking around event dispatching on XRC target QPs
Fix a potential race when event occurrs on a target XRC QP and in the
middle of reporting that on its shared qps, one of them is destroyed
by user space application.  Also add note for kernel consumers in
ib_verbs.h that they must not destroy the QP from within the handler.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:21:32 -07:00
Yijing Wang
b29b076394 IB/qib: Clean up unnecessary MSI/MSI-X capability find
PCI core will initialize device MSI/MSI-X capability in
pci_msi_init_pci_dev().  So device drivers should use
pci_dev->msi_cap/msix_cap to determine whether a device supports
MSI/MSI-X instead of using pci_find_capability(pci_dev,
PCI_CAP_ID_MSI/MSIX).  Access to PCIe device config space again will
consume more time.

Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:17:23 -07:00
Paul Bolle
bea25e82c6 IB/qib: Make qib_driver static
struct pci_driver qib_driver is only used in qib_init.c.  Remove it
from qib.h and make it static in qib_init.c.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:14:51 -07:00
CQ Tang
4668e4b527 IB/qib: Improve SDMA performance
1. The code accepts chunks of messages, and splits the chunk into
   packets when converting packets into sdma queue entries.  Adjacent
   packets will use user buffer pages smartly to avoid pinning the
   same page multiple times.

2. Instead of discarding all the work when SDMA queue is full, the
   work is saved in a pending queue.  Whenever there are enough SDMA
   queue free entries, pending queue is directly put onto SDMA queue.

3. An interrupt handler is used to progress this pending queue.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: CQ Tang <cq.tang@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>

[ Fixed up sparse warnings.  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-13 11:14:34 -07:00
Steve Wise
24d44a391f RDMA/cma: Add IPv6 support for iWARP
Modify the type of local_addr and remote_addr fields in struct
iw_cm_id from struct sockaddr_in to struct sockaddr_storage to hold
IPv6 and IPv4 addresses uniformly.

Change the references of local_addr and remote_addr in cxgb4, cxgb3,
nes and amso drivers to match this.  However to be able to actully run
traffic over IPv6, low-level drivers have to add code to support this.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>

[ Fix unused variable warnings when INFINIBAND_NES_DEBUG not set.
  - Roland ]

Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 12:32:31 -07:00
Naresh Gottumukkala
45e86b33ec RDMA/ocrdma: Cache recv DB until QP moved to RTR
1) In post recv, don't ring the DB doorbell if the QP is in RTR state.
   Cache the DB calls, until the QP is moved to RTS state.
2) Add max_rd_sge support to dev->attr.
3) Code cleanup in alloc_pd path.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 11:00:51 -07:00
Naresh Gottumukkala
7b9b1a596e RDMA/ocrdma: Remove __packed
1) Remove __packed for structures.
2) Align and pad all ABI structure to 64 bit boundaries
   instead of using __packed.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 10:59:44 -07:00
Naresh Gottumukkala
057729cb23 RDMA/ocrdma: Remove driver QP state machine
Remove QP state machine in ocrdma low-level driver and use on the core
IB stack's instead.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 10:58:38 -07:00
Naresh Gottumukkala
9c58726ba9 RDMA/ocrdma: Don't allow zero/invalid sgid usage
Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 10:58:38 -07:00
Naresh Gottumukkala
1afc0454b6 RDMA/ocrdma: Remove redundant dev reference
Remove redundant dev reference from structures:

1) ocrdma_cq.
2) ocrdma_ah.
3) ocrdma_hw_mr.
4) ocrdma_mw.
5) ocrdma_srq.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 10:58:38 -07:00
Naresh Gottumukkala
f99b1649db RDMA/ocrdma: Style and redundant code cleanup
Code cleanup and remove redundant code:

1) redundant initialization removed
2) braces changed as per CodingStyle.
3) redundant checks removed
4) extra braces in return statements removed.
5) removed unused pd pointer from mr.
6) reorganized get_dma_mr()
7) fixed set_av() to return error on invalid sgid index.
8) reference to ocrdma_dev removed from struct ocrdma_pd.

Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-12 10:58:37 -07:00
Sagi Grimberg
5587856c96 IB/iser: Introduce fast memory registration model (FRWR)
Newer HCAs and Virtual functions may not support FMRs but rather a fast
registration model, which we call FRWR - "Fast Registration Work Requests".

This model was introduced in 00f7ec36c ("RDMA/core: Add memory management
extensions support") and works when the IB device supports the
IB_DEVICE_MEM_MGT_EXTENSIONS capability.

Upon creating the iser device iser will test whether the HCA supports
FMRs.  If no support for FMRs, check if IB_DEVICE_MEM_MGT_EXTENSIONS
is supported and assign function pointers that handle fast
registration and allocation of appropriate resources (fast_reg
descriptors).

Registration is done using posting IB_WR_FAST_REG_MR to the QP and
invalidations using posting IB_WR_LOCAL_INV.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:10 -07:00
Sagi Grimberg
e657571b76 IB/iser: Place the fmr pool into a union in iser's IB conn struct
This is preparation step for other memory registration methods to be
added.  In addition, change reg/unreg routines signature to indicate
they use FMRs.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:09 -07:00
Sagi Grimberg
919fc274ef IB/iser: Handle unaligned SG in separate function
This routine will be shared with other rdma management schemes.  The
bounce buffer solution for unaligned SG-lists and the sg_to_page_vec
routine are likely to be used for other registration schemes and not
just FMR.

Move them out of the FMR specific code, and call them from there.
Later they will be called also from other reg methods code.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:09 -07:00
Sagi Grimberg
b4e155ffbb IB/iser: Generalize rdma memory registration
Currently the driver uses FMRs as the only means to register the
memory pointed by SG provided by the SCSI mid-layer with the RDMA
device.

As preparation step for adding more methods for fast path memory
registration, make the alloc/free and reg/unreg calls function
pointers, which are for now just set to the existing FMR ones.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:09 -07:00
Shlomo Pongratz
b7f0451309 IB/iser: Accept session->cmds_max from user space
Use cmds_max passed from user space to be the number of PDUs to be
supported for the session instead of hard-coded ISCSI_DEF_XMIT_CMDS_MAX.
This allow controlling the max number of SCSI commands for the session.
Also don't ignore the qdepth passed from user space.

Derive from session->cmds_max the actual number of RX buffers and FMR
pool size to allocate during the connection bind phase.

Since the iser transport connection is established before the iscsi
session/connection are created and bound, we still use one hard-coded
quantity ISER_DEF_XMIT_CMDS_MAX to compute the maximum number of
work-requests to be supported by the RC QP used for the connection.

The above quantity is made to be a power of two between ISCSI_TOTAL_CMDS_MIN
(16) and ISER_DEF_XMIT_CMDS_MAX (512) inclusive.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:08 -07:00
Shlomo Pongratz
986db0d6c0 IB/iser: Restructure allocation/deallocation of connection resources
This is a preparation step to a patch that accepts the number of max
SCSI commands to be supported a session from user space iSCSI tools.

Move the allocation of the login buffer, FMR pool and its associated
page vector from iser_create_ib_conn_res() (which is called prior when
we actually know how many commands should be supported) to
iser_alloc_rx_descriptors() (which is called during the iscsi
connection bind step where this quantity is known).

Also do small refactoring around the deallocation to make that path
similar to the allocation one.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:08 -07:00
Or Gerlitz
f91424cf5b IB/iser: Use proper debug level value for info prints
Commit 4f36388261 ("IB/iser: Move informational messages from error
to info level") set info prints to be emitted at a lower debug level
than warning prints, which is a bit odd.  Fix that.

Also move the prints on unaligned SG from warning to debug level.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-08-09 17:18:08 -07:00
Roland Dreier
569935db80 Merge branches 'cma', 'cxgb3', 'cxgb4', 'ipoib', 'misc', 'mlx4', 'mlx5', 'nes', 'ocrdma' and 'qib' into for-next 2013-07-31 14:24:06 -07:00
Erez Shitrit
c290414169 IPoIB: Fix pkey change flow for virtualization environments
IPoIB's required behaviour w.r.t to the pkey used by the device is the following:

- For "parent" interfaces (e.g ib0, ib1, etc) who are created
  automatically as a result of hot-plug events from the IB core, the
  driver needs to take whatever pkey vlaue it finds in index 0, and
  stick to that index.

- For child interfaces (e.g ib0.8001, etc) created by admin directive,
  the driver needs to use and stick to the value provided during its
  creation.

In SR-IOV environment its possible for the VF probe to take place
before the cloud management software provisions the suitable pkey for
the VF in the paravirtualed PKEY table index 0. When this is the case,
the VF IB stack will find in index 0 an invalide pkey, which is all
zeros.

Moreover, the cloud managment can assign the pkey value at index 0 at
any time of the guest life cycle.

The correct behavior for IPoIB to address these requirements for
parent interfaces is to use PKEY_CHANGE event as trigger to optionally
re-init the device pkey value and re-create all the relevant resources
accordingly, if the value of the pkey in index 0 has changed (from
invalid to valid or from valid value X to invalid value Y).

This patch enhances the heavy flushing code which is triggered by pkey
change event, to behave correctly for parent devices. For child
devices, the code remains the same, namely chases pkey value and not
index.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-31 14:23:44 -07:00
Or Gerlitz
3d790a4c26 IPoIB: Make sure child devices use valid/proper pkeys
Make sure that the IB invalid pkey (0x0000 or 0x8000) isn't used for
child devices.

Also, make sure to always set the full membership bit for the pkey of
devices created by rtnl link ops.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-31 14:23:40 -07:00
Jack Morgenstein
ef5ed4166f IB/core: Create QP1 using the pkey index which contains the default pkey
Currently, QP1 is created using pkey index 0. This patch simply looks
for the index containing the default pkey, rather than hard-coding
pkey index 0.

This change will have no effect in native mode, since QP0 and QP1 are
created before the SM configures the port, so pkey table will still be
the default table defined by the IB Spec, in C10-123: "If non-volatile
storage is not used to hold P_Key Table contents, then if a PM
(Partition Manager) is not present, and prior to PM initialization of
the P_Key Table, the P_Key Table must act as if it contains a single
valid entry, at P_Key_ix = 0, containing the default partition
key. All other entries in the P_Key Table must be invalid."

Thus, in the native mode case, the driver will find the default pkey
at index 0 (so it will be no different than the hard-coding).

However, in SR-IOV mode, for VFs, the pkey table may be
paravirtualized, so that the VF's pkey index zero may not necessarily
be mapped to the real pkey index 0. For VFs, therefore, it is
important to find the virtual index which maps to the real default
pkey.

This commit does the following for QP1 creation:

1. Find the pkey index containing the default pkey, and use that index
   if found.  ib_find_pkey() returns the index of the
   limited-membership default pkey (0x7FFF) if the full-member default
   pkey is not in the table.

2. If neither form of the default pkey is found, use pkey index 0
   (previous behavior).

Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-31 14:15:17 -07:00
Andi Shyti
618af3846b mlx5_core: Variable may be used uninitialized
In the sq_overhead() function, if qp_typ is equal to IB_QPT_RC, size
will be used uninitialized.

Signed-off-by: Andi Shyti <andi@etezian.org>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-31 14:12:33 -07:00
Dan Carpenter
92b0ca7cb1 IB/mlx5: Fix stack info leak in mlx5_ib_alloc_ucontext()
We don't set "resp.reserved".  Since it's at the end of the struct
that means we don't have to copy it to the user.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-31 14:12:08 -07:00
Wei Yongjun
281d1a9211 IB/mlx5: Fix error return code in init_one()
Fix to return a negative error code from the error handling case
instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-07-31 14:12:07 -07:00