Support creating and destroying XRC domains. Any sharing of the XRCD
is managed above the low-level driver.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second. Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix build errors:
drivers/infiniband/hw/ipath/ipath_init_chip.c:54:1: error: 'S_IRUGO' undeclared here (not in a function)
drivers/infiniband/hw/ipath/ipath_init_chip.c:54:1: error: bit-field '<anonymous>' width not an integer constant
drivers/infiniband/hw/ipath/ipath_init_chip.c:67:1: error: 'S_IWUSR' undeclared here (not in a function)
drivers/infiniband/hw/ipath/ipath_init_chip.c:67:1: error: bit-field '<anonymous>' width not an integer constant
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support for Packed and Unaligned (PAU) FPDUs is needed for
interoperability between NES and non-NES nodes. When the NES hardware
detects a PAU frame, it will pass it to the driver to process the
frame. NES driver creates a new frame for each FPDU and forwards it
to the hardware to be sent to its associated qp.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Print the IP address of the remote host when a critical asynchronous event is
received.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fixes a crash that occurs during close when error async event is received.
Terminate message is not sent to the remote node if already processing close.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When I converted some drivers from pci_map_page to skb_frag_dma_map I
neglected to convert PCI_DMA_xDEVICE into DMA_x_DEVICE and
pci_dma_mapping_error into dma_mapping_error.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for Enhanced RDMA Connection Establishment
(draft-ietf-storm-mpa-peer-connect-06), aka MPAv2. Details of draft
can be obtained from:
<http://www.ietf.org/id/draft-ietf-storm-mpa-peer-connect-06.txt>
For backwards compatibility, the MPAv2 enabled driver reverts to MPAv1
if the remote node doesn't support MPAv2.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com>
Signed-off-by: Faisal Latif <Faisal.Latif@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds support for Enhanced RDMA Connection Establishment
(draft-ietf-storm-mpa-peer-connect-06), aka MPAv2. Details of draft
can be obtained from:
<http://www.ietf.org/id/draft-ietf-storm-mpa-peer-connect-06.txt>
The patch updates the following functions for initiator perspective:
- send_mpa_request
- process_mpa_reply
- post_terminate for TERM error codes
- destroy_qp for TERM related change
- adds layer/etype/ecode to c4iw_qp_attrs for sending with TERM
- peer_abort for retrying connection attempt with MPA_v1 message
- added c4iw_reconnect function
The patch updates the following functions for responder perspective:
- process_mpa_request
- send_mpa_reply
- c4iw_accept_cr
- passes ird/ord to upper layers
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
As part of MPAv2 Enhanced RDMA Negotiation, pass max supported ird/ord
values upwards for the time being in iw_cxgb3 and amso1100.
Signed-off-by: Kumar Sanghvi <kumaras@chelsio.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code that was recently introduced to report the number
of free contexts is flawed for multiple HCAs:
/* Return the number of free user ports (contexts) available. */
return scnprintf(buf, PAGE_SIZE, "%u\n", dd->cfgctxts -
dd->first_user_ctxt - (u32)qib_stats.sps_ctxts);
The qib_stats is global to the module, not per HCA, so the code is broken
for multiple HCAs.
This patch adds a qib_devdata field, freectxts, that reflects the free
contexts for this HCA.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Reviewed-by: Ram Vepa <ram.vepa@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add calls to ib_umem_release(), as in the other error-handling code in
nes_reg_user_mr().
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since commit e58aa3d2d0 ("genirq: Run irq handlers with interrupts
disabled"), we run all interrupt handlers with interrupts disabled and
we even check and yell when an interrupt handler returns with
interrupts enabled -- cf commit b738a50a20 ("genirq: Warn when
handler enables interrupts").
So now this flag is a no-op and can be removed.
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Acked-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The iw_cxgb4 module crashes at init time if the T4 card does not
support RDMA. So clean up the init logic to correctly deal with
non-RDMA cards.
- If any RDMA resources are not available, then fail the initialization
logging an info message.
- Clean up properly on initialization failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This oops was reported recently:
d:mon> e
cpu 0xd: Vector: 300 (Data Access) at [c0000000fd4c7120]
pc: d00000000076f194: .t3_l2t_get+0x44/0x524 [cxgb3]
lr: d000000000b02108: .init_act_open+0x150/0x3d4 [cxgb3i]
sp: c0000000fd4c73a0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc0000000fd640d40
paca = 0xc00000000054ff80
pid = 5085, comm = iscsid
d:mon> t
[c0000000fd4c7450] d000000000b02108 .init_act_open+0x150/0x3d4 [cxgb3i]
[c0000000fd4c7500] d000000000e45378 .cxgbi_ep_connect+0x784/0x8e8 [libcxgbi]
[c0000000fd4c7650] d000000000db33f0 .iscsi_if_rx+0x71c/0xb18
[scsi_transport_iscsi2]
[c0000000fd4c7740] c000000000370c9c .netlink_data_ready+0x40/0xa4
[c0000000fd4c77c0] c00000000036f010 .netlink_sendskb+0x4c/0x9c
[c0000000fd4c7850] c000000000370c18 .netlink_sendmsg+0x358/0x39c
[c0000000fd4c7950] c00000000033be24 .sock_sendmsg+0x114/0x1b8
[c0000000fd4c7b50] c00000000033d208 .sys_sendmsg+0x218/0x2ac
[c0000000fd4c7d70] c00000000033f55c .sys_socketcall+0x228/0x27c
[c0000000fd4c7e30] c0000000000086a4 syscall_exit+0x0/0x40
--- Exception: c01 (System Call) at 00000080da560cfc
The root cause was an EEH error, which sent us down the offload_close path in
the cxgb3 driver, which in turn sets cdev->l2opt to NULL, without regard for
upper layer driver (like the cxgbi drivers) which might have execution contexts
in the middle of its use. The result is the oops above, when t3_l2t_get attempts
to dereference L2DATA(cdev)->nentries in arp_hash right after the EEH error handler sets it to NULL.
The fix is to prevent the setting of the NULL pointer until after there are no
further users of it. The t3cdev->l2opt pointer is now converted to be an rcu
pointer and the L2DATA macro is now called under the protection of the
rcu_read_lock(). When the EEH error path:
t3_adapter_error->offload_close->cxgb3_offload_deactivate
Is exectured, setting of that l2opt pointer to NULL, is now gated on an rcu
quiescence point, preventing, allowing L2DATA callers to safely check for a NULL
pointer without concern that the underlying data will be freeded before the
pointer is dereferenced.
This has been tested by the reporter and shown to fix the reproted oops
[nhorman: fix up unitinialised variable reported by Dan Carpenter]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Karen Xie <kxie@chelsio.com>
Cc: stable@kernel.org
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Faisal Latif <faisal.latif@intel.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Faisal Latif <faisal.latif@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tom Tucker <tom@opengridcomputing.com>
Cc: Steve Wise <swise@opengridcomputing.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Moves the drivers for the Chelsio chipsets into
drivers/net/ethernet/chelsio/ and the necessary Kconfig and Makefile
changes.
CC: Divy Le Ray <divy@chelsio.com>
CC: Dimitris Michailidis <dm@chelsio.com>
CC: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (26 commits)
IB/qib: Defer HCA error events to tasklet
mlx4_core: Bump the driver version to 1.0
RDMA/cxgb4: Use printk_ratelimited() instead of printk_ratelimit()
IB/mlx4: Support PMA counters for IBoE
IB/mlx4: Use flow counters on IBoE ports
IB/pma: Add include file for IBA performance counters definitions
mlx4_core: Add network flow counters
mlx4_core: Fix location of counter index in QP context struct
mlx4_core: Read extended capabilities into the flags field
mlx4_core: Extend capability flags to 64 bits
IB/mlx4: Generate GID change events in IBoE code
IB/core: Add GID change event
RDMA/cma: Don't allow IPoIB port space for IBoE
RDMA: Allow for NULL .modify_device() and .modify_port() methods
IB/qib: Update active link width
IB/qib: Fix potential deadlock with link down interrupt
IB/qib: Add sysfs interface to read free contexts
IB/mthca: Remove unnecessary read of PCI_CAP_ID_EXP
IB/qib: Remove double define
IB/qib: Remove unnecessary read of PCI_CAP_ID_EXP
...
With ib_qib options:
options ib_qib krcvqs=1 pcie_caps=0x51 rcvhdrcnt=4096 singleport=1 ibmtu=4
a run of ib_write_bw -a yields the following:
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
1048576 5000 2910.64 229.80
------------------------------------------------------------------
The top cpu use in a profile is:
CPU: Intel Architectural Perfmon, speed 2400.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
of 0x00 (No unit mask) count 1002300
Counted LLC_MISSES events (Last level cache demand requests from this core that
missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples % samples % app name symbol name
15237 29.2642 964 17.1195 ib_qib.ko qib_7322intr
12320 23.6618 1040 18.4692 ib_qib.ko handle_7322_errors
4106 7.8860 0 0 vmlinux vsnprintf
Analysis of the stats, profile, the code, and the annotated profile indicate:
- All of the overflow interrupts (one per packet overflow) are
serviced on CPU0 with no mitigation on the frequency.
- All of the receive interrupts are being serviced by CPU0. (That is
the way truescale.cmds statically allocates the kctx IRQs to CPU)
- The code is spending all of its time servicing QIB_I_C_ERROR
RcvEgrFullErr interrupts on CPU0, starving the packet receive
processing.
- The decode_err routine is very inefficient, using a printf variant
to format a "%s" and continues to loop when the errs mask has been
cleared.
- Both qib_7322intr and handle_7322_errors read pci registers, which
is very inefficient.
The fix does the following:
- Adds a tasklet to service QIB_I_C_ERROR
- Replaces the very inefficient scnprintf() with a memcpy(). A field
is added to qib_hwerror_msgs to save the sizeof("string") at
compile time so that a strlen is not needed during err_decode().
- The most frequent errors (Overflows) are serviced first to exit the
loop as early as possible.
- The loop now exits as soon as the errs mask is clear rather than
fruitlessly looping through the msp array.
With this fix the performance changes to:
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
1048576 5000 2990.64 2941.35
------------------------------------------------------------------
During testing of the error handling overflow patch, it was determined
that some CPU's were slower when servicing both overflow and receive
interrupts on CPU0 with different MSI interrupt vectors.
This patch adds an option (krcvq01_no_msi) to not use a dedicated MSI
interrupt for kctx's < 2 and to service them on the default interrupt.
For some CPUs, the cost of the interrupt enter/exit is more costly
than then the additional PCI read in the default handler.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- unify vlan and nonvlan rx path
- kill nesvnic->vlan_grp and nes_netdev_vlan_rx_register
- allow to turn on/off rx/tx vlan accel via ethtool (set_features)
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since printk_ratelimit() shouldn't be used anymore (see comment in
include/linux/printk.h), replace it with printk_ratelimited().
Signed-off-by: Manuel Zerpies <manuel.f.zerpies@ww.stud.uni-erlangen.de>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use the per port counter attached to all QPs created on that port to
implement port level packets/bytes performance counters a la IB.
Derived from a patch by Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allocate flow counter per Ethernet/IBoE port, and attach this counter
to all the QPs created on that port. Based on patch by Eli Cohen
<eli@mellanox.co.il>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move the various definitions and mad structures needed for software
implementation of IBA PM agent from the ipath and qib drivers into a
single include file, which in turn could be used by more consumers.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IBoE doesn't use LIDs. Use the GID change event to update the IB core
cache for addition/deletion of GIDs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
These methods don't make sense for iWARP devices, so rather than
forcing them to implement stubs, just return -ENOSYS in the core if
the hardware driver doesn't set .modify_device and/or .modify_port.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update the active link width on QLE7220 chips when link goes down if
chip width does not match shadowed width.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There is a possibility of a deadlock due to the way locks are
acquired and released in qib_set_uevent_bits(). The function
qib_set_uevent_bits() is called in process context and it uses
spin_lock() and spin_unlock(). This same lock is acquired/released
in interrupt context which can lead to a deadlock when running on
the same cpu.
The fix is to replace spin_lock() and spin_unlock() with
spin_lock_irqsave() and spin_unlock_irqrestore() respectively in
qib_set_uevent_bits().
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Indicate the number of free user contexts via the sysfs file
/sys/class/infiniband/qib0/nfreectxts as required for PSM.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The PCIE capability offset is saved during PCI bus walking. It will
remove an unnecessary search in the PCI configuration space if this
value is referenced instead of reacquiring it. Also, pci_is_pcie is a
better way of determining if the device is PCIE or not (as it uses the
same saved PCIE capability offset).
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The PCIE capability offset is saved during PCI bus walking. It will
remove an unnecessary search in the PCI configuration space if this
value is referenced instead of reacquiring it.
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Adapt to new api. We plan to remove old one later. Almost all
changes are trivial, but there is one real fix: the following code is
unsafe:
int ncpus = num_online_cpus()
for (i = 0; i < ncpus; i++) {
..
}
because 1) we don't guarantee last bit of online cpus is equal to
num_online_cpus(). some arch assign sparse cpu number. 2) cpu
hotplugging may change cpu_online_mask at same time. we need to pin
it by get_online_cpus().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Adapt to use new APIs. We plan to remove old one later and plan to
change current->cpus_allowed implementation.
No functional change.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Due to timing, it is possible for the LOS and DFE to remain on. This
is due to the link progressing to LinkUP prior to the driver getting
the first Status Changed interrupt. By expanding the conditions under
which LOS is turned off and DFE timeout is being set, timing is no
longer an issue.
Signed-off-by: Mitko Haralanov <mitko@qlogic.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- fix a race where the driver could end up sending a close_con_req
after an abort_rpl. In c4iw_ep_disconnect(), send abort or close
request with the ep mutex held.
- fix a hang where driver fails to wake up when a connection is reset
during a normal close. Wake up any waiters in the interrupt path,
and correctly cleanup after rdma_fini() failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove left-over code from T3 that limited MR sizes to 32b.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Memory allocated for user CQs gets rounded up to the next page
boundary. And after rounding, we recalculate the resulting IQ depth
and we need to make sure we don't exceed the HW limits.
This bug can result a much smaller CQ allocated than was expected if
the HW size field is exceeded, resulting in CQ overflow failures.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* remove interrupt.g inclusion from netdevice.h -- not needed
* fixup fallout, add interrupt.h and hardirq.h back where needed.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>