Running 'ifconfig up' on the cxgb3 interface with iw_cxgb3 loaded
causes a deadlock. The rtnl lock is already held in this path. The
function fw_supports_fastreg() was introduced in 2.6.27 to
conditionally set the IB_DEVICE_MEM_MGT_EXTENSIONS bit iff the
firmware was at 7.0 or greater, and this function also acquires the
rtnl lock and which thus causes a deadlock. Further, if iw_cxgb3 is
loaded _after_ the nic interface is brought up, then the deadlock does
not occur and therefore fw_supports_fastreg() does need to grab the
rtnl lock in that path.
It turns out this code is all useless anyway. The low level driver
will NOT allow the open if the firmware isn't 7.0, so iw_cxgb3 can
always set the MEM_MGT_EXTENSIONS bit. Simplify...
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- MWs don't have local read/write permissions.
- Set the MW_BIND enabled bit if a MR has MW_BIND access.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Set the stag0 and fastreg capability bits only for kernel qps.
- QP_PRIV flag is no longer used, so don't set it.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
from include/asm-powerpc. This is the result of a
mkdir arch/powerpc/include/asm
git mv include/asm-powerpc/* arch/powerpc/include/asm
Followed by a few documentation/comment fixups and a couple of places
where <asm-powepc/...> was being used explicitly. Of the latter only
one was outside the arch code and it is a driver only built for powerpc.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some module parameters with only one line have the '\n' at the end of the
description. This is not needed nor wanted as after the description the
type (i.e. int) is followed by a newline.
Some modules contain a multi-line description, these are not affected
by this patch.
Signed-off-by: Niels de Vos <niels.devos@wincor-nixdorf.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: John W. Linville <linville@tuxdriver.com>
Cc: Ed L. Cashin <ecashin@coraid.com>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Roland Dreier <rolandd@cisco.com>
Acked-by: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
mlx4: Update/add Mellanox Technologies copyright lines to mlx4 driver files
mlx4_core: Add VLAN tag field to WQE control segment struct
RDMA/nes: CM connection setup/teardown rework
IPoIB: Correct help text for INFINIBAND_IPOIB_DEBUG
IPoIB/cm: Connected mode is no longer EXPERIMENTAL
RDMA/ucm: BKL is not needed for ib_ucm_open()
RDMA/ucma: BKL is not needed for ucma_open()
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:
This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread). So I
CC'ed this to KVM camp. Comments are appreciated.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
NULL, the system-wide dma_ops pointer is used as before.
If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging). It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.
The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations. So x86 can't have dma_mapping_ops per
device. Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.
The first patch adds the device argument to dma_mapping_error. The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.
This patch:
dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations. So we can't have dma_mapping_ops per device.
Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
argument.
[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update existing Mellanox copyright lines to 2008, and add such lines
to files where they are missing.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Major rework of CM connection setup/teardown. We had a number of issues
with MPI applications not starting/terminating properly over time.
With these changes we were able to run longer on larger clusters.
* Remove memory allocation from nes_connect() and nes_cm_connect().
* Fix mini_cm_dec_refcnt_listen() when destroying listener.
* Remove unnecessary code from schedule_nes_timer() and nes_cm_timer_tick().
* Functionalize mini_cm_recv_pkt() and process_packet().
* Clean up cm_node->ref_count usage.
* Reuse skbs if available.
Signed-off-by: Faisal Latif <flatif@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
MAINTAINERS: Remove Glenn Streiff from NetEffect entry
mlx4_core: Improve error message when not enough UAR pages are available
IB/mlx4: Add support for memory management extensions and local DMA L_Key
IB/mthca: Keep free count for MTT buddy allocator
mlx4_core: Keep free count for MTT buddy allocator
mlx4_code: Add missing FW status return code
IB/mlx4: Rename struct mlx4_lso_seg to mlx4_wqe_lso_seg
mlx4_core: Add module parameter to enable QoS support
RDMA/iwcm: Remove IB_ACCESS_LOCAL_WRITE from remote QP attributes
IPoIB: Include err code in trace message for ib_sa_path_rec_get() failures
IB/sa_query: Check if sm_ah is NULL in ib_sa_remove_one()
IB/ehca: Release mutex in error path of alloc_small_queue_page()
IB/ehca: Use default value for Local CA ACK Delay if FW returns 0
IB/ehca: Filter PATH_MIG events if QP was never armed
IB/iser: Add support for RDMA_CM_EVENT_ADDR_CHANGE event
RDMA/cma: Add RDMA_CM_EVENT_TIMEWAIT_EXIT event
RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
cpumask: Provide a generic set of CPUMASK_ALLOC macros
cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
Revert "cpumask: introduce new APIs"
cpumask: make for_each_cpu_mask a bit smaller
net: Pass reference to cpumask variable in net/sunrpc/svc.c
...
Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually
Add support for the following operations to mlx4 when device firmware
supports them:
- Send with invalidate and local invalidate send queue work requests;
- Allocate/free fast register MRs;
- Allocate/free fast register MR page lists;
- Fast register MR send queue work requests;
- Local DMA L_Key.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
MTT entries are allocated with a buddy allocator, which just keeps
bitmaps for each level of the buddy table. However, all free space
starts out at the highest order, and small allocations start scanning
from the lowest order. When the lowest order tables have no free
space, this can lead to scanning potentially millions of bits before
finding a free entry at a higher order.
We can avoid this by just keeping a count of how many free entries
each order has, and skipping the bitmap scan when an order is
completely empty. This provides a nice performance boost for a
negligible increase in memory usage.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The pd->lock mutex is released on a successful return, so it should be
released on an error return as well.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
expression l;
@@
mutex_lock(l);
... when != mutex_unlock(l)
when any
when strict
(
if (...) { ... when != mutex_unlock(l)
+ mutex_unlock(l);
return ...;
}
|
mutex_unlock(l);
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some firmware versions report a Local CA ACK Delay of 0. In that
case, return a more sensible default value of 12 (-> 16 msec) instead.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Certain firmware versions sometimes cause spurious PATH_MIG events to
occur during QP creation. Filter these events by making sure PATH_MIG
events are only handed down when they actually make sense (i.e. when
the QP has been armed at least once).
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
device_create() is race-prone, so use the race-free
device_create_drvdata() instead as device_create() is going away.
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Current code uses kmalloc() and then just does a bitwise OR operation on
qp->flags in create_qp_common(), which means that qp->flags may
potentially have some unintended bits set. This patch uses kzalloc()
and avoids further explicit clearing of structure members, which also
shrinks the code:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-65 (-65)
function old new delta
create_qp_common 2024 1959 -65
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Handling the zero STag in receive work request requires some extra
logic in the driver:
- Only set the QP_PRIV bit for kernel mode QPs.
- Add a zero STag build function for recv wrs. The uP needs a PBL
allocated and passed down in the recv WR so it can construct a HW
PBL for the zero STag S/G entries. Note: we need to place a few
restrictions on zero STag usage because of this:
1) all SGEs in a recv WR must either be zero STag or not. No mixing.
2) an individual SGE length cannot exceed 128MB for a zero-stag SGE.
This should be OK since it's not really practical to allocate
such a large chunk of pinned contiguous DMA mapped memory.
- Add an optimized non-zero-STag recv wr format for kernel users.
This is needed to optimize both zero and non-zero STag cracking in
the recv path for kernel users.
- Remove the iwch_ prefix from the static build functions.
- Bump required FW version.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
- Change the IB_DEVICE_ZERO_STAG flag to the transport-neutral name
IB_DEVICE_LOCAL_DMA_LKEY, which is used by iWARP RNICs to indicate 0
STag support and IB HCAs to indicate reserved L_Key support.
- Add a u32 local_dma_lkey member to struct ib_device. Drivers fill
this in with the appropriate local DMA L_Key (if they support it).
- Fix up the drivers using this flag.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The MLX transport requires two extra gather entries for sends (one for
the header and one for the checksum at the end, as the comment says).
However the code checked that max_recv_sge was not too big, instead of
checking max_send_sge as it should have. Fix the code to check the
correct condition.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Exactly when the catastrophic error polling timer function runs is not
important, so use round_jiffies() to save unnecessary wakeups.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Since we use del_timer_sync() anyway, there's no need for an
additional flag to tell the timer not to rearm.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The IB spe. for SubnGet(NodeInfo) and query HCA says that the vendor
ID field should be the IEEE OUI assigned to the vendor. The ipath
driver was returning the PCI vendor ID instead. This will affect
applications which call ibv_query_device(). The old value was
0x001fc1 or 0x001077, the new value is 0x001175.
The vendor ID doesn't appear to be exported via /sys so that should
reduce possible compatibility issues. I'm only aware of Open MPI as a
major application which depends on this change, and they have made
necessary adjustments.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Every caller of nes_post_cqp_request() passed it NES_CQP_REQUEST_RING_DOORBELL,
so just remove that parameter and always ring the doorbell.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Faisal Latif <flatif@neteffect.com>
cxgb3 does not currently report the page size capabilities, and
incorrectly reports them internally.
This version changes the bit-shifting to a static value (per Steve's
request).
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This gives ehca an autogenerated modalias and therefore enables automatic loading.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK
flag by using the per-multicast group loopback blocking feature of
mlx4 hardware.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Add a new rdma ctl command called RDMA_GET_MIB to the cxgb3 low
level driver to obtain the protocol mib from the rnic hardware.
- Add new iw_cxgb3 provider method to get the MIB from the low level
driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The members struct iwch_rnic_attributes.vendor_id and .vendor_part_id
are write-only, so we might as well get rid of them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
- set fw_ver
- set hw_ver
- set max_qp_wr to something reasonable
- set max_cqe to something reasonable
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
During corner case testing, we noticed that some versions of ehca do
not properly transition to interrupt done in special load situations.
This can be resolved by periodically triggering EOI through H_EOI, if
EQEs are pending.
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 65adfa91 ("IB/mlx4: Fix RESET to RESET and RESET to ERROR
transitions") added some extra code to handle a QP state transition
from RESET to ERROR. However, the latest 1.2.1 version of the IB spec
has clarified that this transition is actually not allowed, so we can
remove this extra code again.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit b18aad71 ("IB/mthca: Fix RESET to ERROR transition") added some
extra code to handle a QP state transition from RESET to ERROR.
However, the latest 1.2.1 version of the IB spec has clarified that
this transition is actually not allowed, so we can remove this extra
code again.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX HCAs support the IB_MGMT_CLASS_CONG_MGMT management class, so
process MADs of this class through the MAD_IFC firmware command.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX returns the max message size it supports through the
QUERY_DEV_CAP firmware command. When modifying a QP to RTR, the max
message size for the QP must be specified. This value must not exceed
the value declared through QUERY_DEV_CAP. The current code ignores
the max allowed size and unconditionally sets the value to 2^31. This
patch sets all QPs to the max value allowed as returned from firmware.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- set IB_DEVICE_MEM_MGT_EXTENSIONS capability bit if fw supports it.
- set max_fast_reg_page_list_len device attribute.
- add iwch_alloc_fast_reg_mr function.
- add iwch_alloc_fastreg_pbl
- add iwch_free_fastreg_pbl
- adjust the WQ depth for kernel mode work queues to account for
fastreg possibly taking 2 WR slots.
- add fastreg_mr work request support.
- add local_inv work request support.
- add send_with_inv and send_with_se_inv work request support.
- removed useless duplicate enums/defines for TPT/MW/MR stuff.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds support for the IB "base memory management extension"
(BMME) and the equivalent iWARP operations (which the iWARP verbs
mandates all devices must implement). The new operations are:
- Allocate an ib_mr for use in fast register work requests.
- Allocate/free a physical buffer lists for use in fast register work
requests. This allows device drivers to allocate this memory as
needed for use in posting send requests (eg via dma_alloc_coherent).
- New send queue work requests:
* send with remote invalidate
* fast register memory region
* local invalidate memory region
* RDMA read with invalidate local memory region (iWARP only)
Consumer interface details:
- A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
to indicate device support for these features.
- New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
IB_WR_RDMA_READ_WITH_INV are added.
- A new consumer API function, ib_alloc_mr() is added to allocate
fast register memory regions.
- New consumer API functions, ib_alloc_fast_reg_page_list() and
ib_free_fast_reg_page_list() are added to allocate and free
device-specific memory for fast registration page lists.
- A new consumer API function, ib_update_fast_reg_key(), is added to
allow the key portion of the R_Key and L_Key of a fast registration
MR to be updated. Consumers call this if desired before posting
a IB_WR_FAST_REG_MR work request.
Consumers can use this as follows:
- MR is allocated with ib_alloc_mr().
- Page list memory is allocated with ib_alloc_fast_reg_page_list().
- MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
invalidate operation.
- MR is deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list().
Applications can allocate a fast register MR once, and then can
repeatedly bind the MR to different physical block lists (PBLs) via
posting work requests to a send queue (SQ). For each outstanding
MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
allocated (the fast_reg_page_list is owned by the low-level driver
from the consumer posting a work request until the request completes).
Thus pipelining can be achieved while still allowing device-specific
page_list processing.
The 32-bit fast register memory key/STag is composed of a 24-bit index
and an 8-bit key. The application can change the key each time it
fast registers thus allowing more control over the peer's use of the
key/STag (ie it can effectively be changed each time the rkey is
rebound to a page list).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The idea is that for QPs with fixed size work requests (eg selective
signaling QPs), before stamping the WQE, we read the value of the DS
field, which gives the effective size of the descriptor as used in the
previous post. Then we stamp only that area, since the rest of the
descriptor is already stamped.
When initializing the send queue buffer, make sure the DS field is
initialized to the max descriptor size so that the subsequent stamping
will be done on the entire descriptor area.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove an explicit memset(..., 0, ...) of a 'listener' structure
allocated with kzalloc().
Signed-off-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
Acked-by: Faisal Latif <faisal@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The change to iwch_provider.c in commit f4e91eb4 ("IB: convert struct
class_device to struct device") undid the fix done in commit 7f049f2f
("RDMA/cxgb3: Hold rtnl_lock() around ethtool get_drvinfo call"). It
removed the calls to rtnl_lock() that serialized the iw_cxgb3 ethtool
ops calls into the cxgb3 driver. This locking is needed to avoid
messing up the internal state of the cxgb3 driver.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current memfree FW has a bug which in some cases, assumes that ICM
pages passed to it are cleared. This patch uses __GFP_ZERO to
allocate all ICM pages passed to the FW. Once firmware with a fix is
released, we can make the workaround conditional on firmware version.
This fixes the bug reported by Arthur Kepner <akepner@sgi.com> here:
http://lists.openfabrics.org/pipermail/general/2008-May/050026.html
Cc: <stable@kernel.org>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Rewritten to be a one-liner using __GFP_ZERO instead of vmap()ing
ICM memory and memset()ing it to 0. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
All of the open() functions which don't need the BKL on their face may
still depend on its acquisition to serialize opens against driver
initialization. So make those functions acquire then release the BKL to be
on the safe side.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This documents the fact that somebody looked at the relevant open()
functions and concluded that, due to their trivial nature, no locking was
needed.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
nes_reg_user_mr() should fail if page_count becomes >= 1024 * 512
rather than just testing for strict >, because page_count is
essentially used as an index into an array with 1024 * 512 entries, so
allowing the loop to continue with page_count == 1024 * 512 means that
memory after the end of the array is corrupted. This leads to a crash
triggerable by a userspace application that requests registration of a
too-big region.
Also get rid of the call to pci_free_consistent() here to avoid
corrupting state with a double free, since the same memory will be
freed in the code jumped to at reg_user_mr_err.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In 2.6.26, we added some support for send with invalidate work
requests, including a device capability flag to indicate whether a
device supports such requests. However, the support was incomplete:
the completion structure was not extended with a field for the key
contained in incoming send with invalidate requests.
Full support for memory management extensions (send with invalidate,
local invalidate, fast register through a send queue, etc) is planned
for 2.6.27. Since send with invalidate is not very useful by itself,
just remove the IB_DEVICE_SEND_W_INV bit before the 2.6.26 final
release; we will add an IB_DEVICE_MEM_MGT_EXTENSIONS bit in 2.6.27,
which makes things simpler for applications, since they will not have
quite as confusing an array of fine-grained bits to check.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
SM/SMA traps received by the ipath driver should be forwarded to the
SM if it is running on the host. The ib_ipath driver was incorrectly
replying with "bad method."
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The driver supports a few features (RNR NAK, port active event, SRQ
resize) that were not reported in the device capability flags. This
patch fixes that.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Gabriel C <nix.or.die@googlemail.com> pointed out that when the x86
bitops are updated to operate on unsigned long, the code in
sdma_abort_task() will produce warnings:
drivers/infiniband/hw/ipath/ipath_sdma.c: In function 'sdma_abort_task':
drivers/infiniband/hw/ipath/ipath_sdma.c:267: warning: passing argument 2 of 'constant_test_bit' from incompatible pointer type
and so on, because it uses test_bit() to operation on a u64 value
(returned by ipath_read_kref64() for a hardware register).
Fix up these warnings by converting the test_bit() operations to &ing
with appropriate symbolic defines of the bits within the hardware
register. This has the benign side-effect of making the code more
self-documenting as well.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: Paul Jackson <pj@sgi.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When creating a kernel QP where the consumer asked for a send queue
with lots of scatter/gater entries, set_kernel_sq_size() incorrectly
returned an error if the send queue stride is larger than the
hardware's maximum send work request descriptor size. This is not a
problem; the only issue is to make sure that the actual descriptors
used do not overflow the maximum descriptor size, so check this instead.
Clamp the returned max_send_sge value to be no bigger than what
query_device returns for the max_sge to avoid confusing hapless users,
even if the hardware is capable of handling a few more s/g entries.
This bug caused NFS/RDMA mounts to fail when the server adapter used
the mlx4 driver.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move rcu-protected lists from list.h into a new header file rculist.h.
This is done because list are a very used primitive structure all over the
kernel and it's currently impossible to include other header files in this
list.h without creating some circular dependencies.
For example, list.h implements rcu-protected list and uses rcu_dereference()
without including rcupdate.h. It actually compiles because users of
rcu_dereference() are macros. Others RCU functions could be used too but
aren't probably because of this.
Therefore this patch creates rculist.h which includes rcupdates without to
many changes/troubles.
Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The mthca driver returns the maximum number of scatter/gather entries
returned by the firmware as the max_sge value when device properties
are queried. However, the firmware also reports a limit on the
maximum descriptor size allowed, and because mthca takes into account
the worst case send request overhead when checking whether to allow a
QP to be created, the largest number of scatter/gather entries that
can be used with mthca may be limited by the maximum descriptor size
rather than just by the actual s/g entry limit.
This means that applications cannot actually create QPs with
max_send_sge equal to the limit returned by ib_query_device(). Fix
this by checking if the maximum descriptor size imposes a lower limit
and if so returning that lower limit.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/cxgb3/iwch_qp.c: In function 'iwch_post_send':
drivers/infiniband/hw/cxgb3/iwch_qp.c:232: warning: 't3_wr_flit_cnt' may be used uninitialized in this function
This is what akpm describes as "the dopey
gcc-doesn't-know-that-foo(&var)-writes-to-var problem."
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function
This is the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When I fixed the RC receive completion opcode in 2bfc8e9e ("IB/ipath:
Return the correct opcode for RDMA WRITE with immediate"), I forgot to
fix UC, which had the same problem for RDMA write with immediate
returning the wrong opcode.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit f018c7e1 ("IB/ipath: Change ipath_devdata.ipath_sdma_status to be
unsigned long") changed ipath_sdma_status to be unsigned long, but left
a few debug messages that printed it out with a %016llx format, which
generates the warnings
drivers/infiniband/hw/ipath/ipath_sdma.c:348: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
drivers/infiniband/hw/ipath/ipath_sdma.c:618: warning: format '%016llx' expects type 'long long unsigned int', but argument 3 has type 'long unsigned int'
Fix this by changing the format used to print out the value to %08lx
(8 hex digits are now sufficient, because the highest bit used is 31).
Warnings reported by Randy Dunlap <randy.dunlap@oracle.com>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxio_flush_sq() was failing to wrap around the software send queue
causing garbage completion entries on a flush operation.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Andrew Morton <akpm@linux-foundation.org> pointed out that bitops
should take an unsigned long * arg. However, the ipath driver was
doing bitops on struct ipath_devdata.ipath_sdma_status, which is u64.
Change this member to unsigned long to avoid tons of warnings when x86
fixes the bitops to take unsigned long * instead of void *.
Also, change the IPATH_SDMA_RUNNING and IPATH_SDMA_SHUTDOWN bit
numbers to 30 and 31 (instead of 62 and 63) so that we're not setting
another booby trap for someone who tries to make ipath work on a
32-bit architecture.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The official reason is "with the presence of pid namespaces in the
kernel using pid_t-s inside one is no longer safe."
But the reason I fix this right now is the following:
About a month ago (when 2.6.25 was not yet released) there still was a
one last caller of a to-be-deprecated-soon function find_pid() - the
kill_proc() function, which in turn was only used by nfs callback
code.
During the last merge window, this last caller was finally eliminated
by some NFS patch(es) and I was about to finally kill this kill_proc()
and find_pid(), but found, that I was late and the kill_proc is now
called from the ipath driver since commit 58411d1c ("IB/ipath: Head of
Line blocking vs forward progress of user apps").
So here's a patch that fixes this code to use struct pid * and (!)
the kill_pid routine.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If an out of sequence RDMA read response middle or last packet is
received, we should only resend the RDMA read request on the first
out of sequence packet and drop subsequent out of sequence packets
otherwise, we get "too many retries".
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The send DMA hardware queue voided a number of prior assumptions about
when a send is complete which led to completions being generated out of
order. There were also a number of locking issues when switching the QP
to the error or reset states, and we implement the IB_QPS_SQD state.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When errors are detected in RC, the QP should transition to the
IB_QPS_ERR state, not the IB_QPS_SQE state. Also, when the error is on
the responder side, the receive work completion error was incorrect
(remote vs. local).
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix some bugs with the max_aggr module parameter added with LRO support:
- The module parameter value ignored and not actually used to set
lro_mgr.max_aggr.
- MODULE_PARM_DESC had a typo "_mro_" instead of "_lro_" so it didn't
end up describing the actual module parameter.
- The nes_lro_max_aggr variable was declared as unsigned, but the
module_param line said "int" instead of "uint" for the type.
- The default value for the parameter was stuck in the permissions
field of module_param, which led to nonsensical permissions for the
file under /sys/module/iw_nes/param.
- The parameter was used in only one file but defined in another, which
led to the variable being global for no good reason. Move everything
related to the parameter to the file nes_hw.c where it is actually
used.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is necessary because, in a multicore environment, a race between
uverbs async handler and destroy QP could occur.
Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
What's fixed:
in ipath_cancel_sends()
We need to unconditionally set ABORTING. So, swap the tests
so the set_bit() isn't shadowed by the &&.
If we've disarmed the piobufs, then we need to unconditionally
set DISARMED. So, move it out from the overly protective if
at the bottom.
in sdma_abort_task()
Abort_task was written knowing that the SDMA engine would always
be reset (and restarted) on error. A recent change broke that
fundamental assumption by taking the restart portion and making
it conditional on a link status change. But, SDMA can go boom
without a link status change in some conditions.
Signed-off-by: John Gregor <john.gregor@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Now that we always use PIO for vl15 on 7220, we could get stuck forever
if we happened to run out of PIO buffers from the verbs code, because
the setup code wouldn't run; the interrupt was also ignored if SDMA was
supported. We also have to reduce the pio update threshold if we have
fewer kernel buffers than the existing threshold.
Clean up the initialization a bit to get ordering safer and more
sensible, and use the existing ipath_chg_kernavail call to do init,
rather than doing it separately.
Drop unnecessary clearing of pio buffer on pio parity error.
Drop incorrect updating of pioavailshadow when exitting freeze mode
(software state may not match chip state if buffer has been allocated
and not yet written).
If we couldn't get a kernel buffer for a while, make sure we are
in sync with hardware, mainly to handle the exitting freeze case.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The loop in ipath_kreceive() that processes packets increments the
loop-index 'i' once too often, because the exit condition does not
depend on it, and is checked after the increment. By adding a check for
!last to the iterator in the for loop, we correct that in a way that is
not so likely to be re-broken by changes in the loop body.
Signed-off-by: Michael Albaugh <micheal.albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch fixes a bug in the RC responder which generates a completion
entry with the wrong opcode when an RDMA WRITE with immediate is received.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The semantics of cancel_sends changed, but the code using it was missed.
Don't leave sends and pioavail updates disabled, and add a comment as to
why the force update is needed.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a send work request has immediate errors and is not put on the
send queue, we shouldn't update any of the QP state.
The increment of the SSN wasn't obeying this.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We warn about prototype chips, but the function that checks for
support is also called as a result of a get_portinfo request, which
can clutter the logs.
Restrict warning to only appear during initialization.
Signed-off-by: Michael Albaugh <michael.albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently, iw_cxgb3 is severely limited on the amount of userspace
memory that can be registered in in a single memory region, which
causes big problems for applications that expect to be able to
register 100s of MB.
The problem is that the driver uses a single kmalloc()ed buffer to
hold the physical buffer list (PBL) for the entire memory region
during registration, which means that 8 bytes of contiguous memory are
required for each page of memory being registered. For example, a 64
MB registration will require 128 KB of contiguous memory with 4 KB
pages, and it unlikely that such an allocation will succeed on a busy
system.
This is purely a driver problem: the temporary page list buffer is not
needed by the hardware, so we can fix this by writing the PBL to the
hardware in page-sized chunks rather than all at once. We do this by
splitting the memory registration operation up into several steps:
- Allocate PBL space in adapter memory for the full registration
- Copy PBL to adapter memory in chunks
- Allocate STag and enable memory region
This also allows several other cleanups to the __cxio_tpt_op()
interface and related parts of the driver.
This change leaves the reregister memory region and memory window
operations broken, but they already didn't work due to other
longstanding bugs, so fixing them will be left to a later patch.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current iw_cxgb3 code adds PBL memory to the driver's gen_pool in 2 MB
chunks. This limits the largest single allocation that can be done to
the same size, which means that with 4 KB pages, each of which takes 8
bytes of PBL memory, the largest memory region that can be allocated
is 1 GB (256K PBL entries * 4 KB/entry).
Remove this limit by adding all the PBL memory in a single gen_pool
chunk, if possible. Add code that falls back to smaller chunks if
gen_pool_add() fails, which can happen if there is not sufficient
contiguous lowmem for the internal gen_pool bitmap.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Also remove duplicate assignment of local_ca_ack_delay and change
min_t check for local_ca_ack_delay to u8 instead of int.
Signed-off-by: Stefan Roscher <stefan.roscher at de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Testing on large clusters shows its way too short at 10 secs.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove bad BUG_ON() that can trigger in correct operation from
close_con_rpl(). It is possible to get a close_rpl message on a dead
connection. The sequence is:
- host refs ep for close exchange
- host posts close_req
- hw posts PEER_ABORT from incoming RST
- host marks ep DEAD
- host posts ABORT_RPL and releases ep resources
- hw posts CLOSE_RPL
- host derefs ep and ep freed.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
- Flush the QP only after the HW disables the connection. Currently
we flush the QP when transitioning to CLOSING. This exposes a race
condition where the HW can complete a RECV WR, for instance, -and-
the SW can flush that same WR.
- Only call CQ event handlers on flush IFF we actually flushed something.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When I merged bbf8eed1 ("IB/mlx4: Add support for resizing CQs") I
changed things around so that mlx4_ib_alloc_cq_buf() and
mlx4_ib_free_cq_buf() were used everywhere they could be. However, I
screwed up the number of entries passed into mlx4_ib_alloc_cq_buf()
in a couple places -- the function bumps the number of entries
internally, so the caller shouldn't add 1 as well.
Passing a too-big value for the number of entries to mlx4_ib_free_cq_buf()
can cause the cleanup to go off the end of an array and corrupt
allocator state in interesting ways.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Various cleanups:
- Change // to /* .. */
- Place whitespace around binary operators.
- Trim down a few long lines.
- Some minor alignment formatting for better readability.
- Remove some silly tabs.
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch enables the iw_nes module for NetEffect RNICs to support
additional PHYs including SFP+ (referred to as ARGUS in the code).
Signed-off-by: Eric Schneider <eric.schneider@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit cb9fbc5c ("IB: expand ib_umem_get() prototype") changed the
mthca userspace ABI to provide a way for userspace to indicate which
memory regions need the DMA write barrier attribute. However, it is
possible to handle this without breaking existing userspace, by having
the mthca kernel driver recognize whether it is talking to old or new
userspace, depending on the size of the register MR structure passed in.
The only potential drawback of this is that is allows old userspace
(which has a bug with DMA ordering on large SGI Altix systems) to
continue to run on new kernels, but the advantage of allowing old
userspace to continue to work on unaffected systems seems to outweigh
this, and we can print a warning to push people to upgrade their
userspace.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a FMR is unmapped, mthca resets the map count to 0, and clears
the upper part of the R_Key which is used as the sequence counter.
This poses a problem for RDS, which uses ib_fmr_unmap as a fence
operation. RDS assumes that after issuing an unmap, the old R_Keys
will be invalid for a "reasonable" period of time. For instance,
Oracle processes uses shared memory buffers allocated from a pool of
buffers. When a process dies, we want to reclaim these buffers -- but
we must make sure there are no pending RDMA operations to/from those
buffers. The only way to achieve that is by using unmap and sync the
TPT.
However, when the sequence count is reset on unmap, there is a high
likelihood that a new mapping will be given the same R_Key that was
issued a few milliseconds ago.
To prevent this, don't reset the sequence count when unmapping a FMR.
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a lot of QPs fall into Error state at once and the EQ of the
respective HCA is too small, it might overrun, causing the eHCA driver
to stop processing completion events and calling the application's
completion handlers, effectively causing traffic to stop.
Fix this by limiting available QPs and CQs to a customizable max
count, and determining EQ size based on these counts and a worst-case
assumption.
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ehca_create_eq() was assigning a signed return value to an unsiged
local variable and then checking if the variable was < 0, which meant
that errors were always ignored. Fix this by using one variable for
signed integer return values and another for u64 hcall return values.
Bug originally found by Roel Kluin <12o3l@tiscali.nl>.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Open MPI, Intel MPI and other applications don't respect the iWARP
requirement that the client (active) side of the connection send the
first RDMA message. This class of application connection setup is
called peer-to-peer. Typically once the connection is setup, _both_
sides want to send data.
This patch enables supporting peer-to-peer over the chelsio RNIC by
enforcing this iWARP requirement in the driver itself as part of RDMA
connection setup.
Connection setup is extended, when the peer2peer module option is 1,
such that the MPA initiator will send a 0B Read (the RTR) just after
connection setup. The MPA responder will suspend SQ processing until
the RTR message is received and reply-to.
In the longer term, this will be handled in a standardized way by
enhancing the MPA negotiation so peers can indicate whether they
want/need the RTR and what type of RTR (0B read, 0B write, or 0B send)
should be sent. This will be done by standardizing a few bits of the
private data in order to negotiate all this. However this patch
enables peer-to-peer applications now and allows most of the required
firmware and driver changes to be done and tested now.
Design:
- Add a module option, peer2peer, to enable this mode.
- New firmware support for peer-to-peer mode:
- a new bit in the rdma_init WR to tell it to do peer-2-peer
and what form of RTR message to send or expect.
- process _all_ preposted recvs before moving the connection
into rdma mode.
- passive side: defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the RTR is received.
- active side: expect and process the 0B read WR on offload TX
queue. Defer completing the rdma_init WR until all
pre-posted recvs are processed. Suspend SQ processing until
the 0B read WR is processed from the offload TX queue.
- If peer2peer is set, driver posts 0B read request on offload TX
queue just after posting the rdma_init WR to the offload TX queue.
- Add CQ poll logic to ignore unsolicitied read responses.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
cxgb3 only supports 4GB memory regions. The lustre RDMA code uses
this attribute and currently has to code around our bad setting.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Open MPI and other stress testing exposed a few bad bugs in handling
aborts in the middle of a normal close. Fix these by:
- serializing abort reply and peer abort processing with disconnect
processing
- warning (and ignoring) if ep timer is stopped when it wasn't running
- cleaning up disconnect path to correctly deal with aborting and
dead endpoints
- in iwch_modify_qp(), taking a ref on the ep before releasing the qp
lock if iwch_ep_disconnect() will be called. The ref is dropped
after calling disconnect.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Extend the mlx4_cq_resize() API with a way to set the "collapsed" flag
for the CQ being created.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the volatile qualifier from the cq_vbase member of struct
nes_hw_cq, and add an rmb() in the one place where it looks like
access order might make a difference. As usual, removing a volatile
qualifier in a declaration is actually a bug fix, since a volatile
qualifier is not sufficient to make sure that aggressively
out-of-order CPUs don't reorder things and cause incorrect results.
For example, a CPU might speculatively execute reads of other cqe
fields before the NIC hardware has written those fields and before it
has set the NES_CQE_VALID bit (even though those reads come after the
test of the NES_CQE_VALID bit in program order), but then when the CPU
actually executes the conditional test of the NES_CQE_VALID, the bit
has been set, and the CPU will proceed with the results of the earlier
speculative execution and end up using bogus data.
This also gets rid of the warning:
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_destroy_cq':
drivers/infiniband/hw/nes/nes_verbs.c:1978: warning: passing argument 3 of 'pci_free_consistent' discards qualifiers from pointer target type
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In addition to mlx4_ib, there will be ethernet and FC consumers of
mlx4_core, so move the code for managing kernel doorbells into the
core module to avoid having to duplicate this multiple times.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Always enable large page support; didn't seem to cause problems for anyone.
Signed-off-by: Joachim Fenkes <fenkes@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
After PXE boot, the iw_nes driver does a full reset to ensure the card
is in a clean state. However, it doesn't wait for firmware to
complete its work before issuing a port reset to enable the ports,
which leads to problems bringing up the ports.
The solution is to wait for firmware to complete its work before
proceeding with port reset.
This bug was flagged by Roland Dreier <rolandd@cisco.com>.
Cc: <stable@kernel.org>
Signed-off-by: Chien Tung <ctung@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use NIPQUAD_FMT instead of printing raw 32-bit hex quantities in
debugging output.
Acked-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PCI MSI interface is stubbed out properly so that all the
functions just return failure if PCI_MSI=n, so there's no reason to
have "#ifdef CONFIG_PCI_MSI" blocks in ipath_iba7220.c.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Before IBA7220 support was added, the ipath driver didn't support any
hardware unless PCI_MSI and/or HT_IRQ was enabled. However, the
IBA7220 can generate INTx interrupts, so it makes sense to allow the
driver to be build even if PCI_MSI=n and HT_IRQ=n.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The new IBA7220 code added a call to ipath_init_iba7220_funcs() that
is compiled unconditionally, but only built the IBA7220 code if
PCI_MSI is enabled. Fix this by building the IBA7220 file
unconditonally.
This fixes build breakage when PCI_MSI=n, HT_IRQ=y and
INFINIBAND_IPATH=y reported by Ingo Molnar <mingo@elte.hu>:
drivers/built-in.o: In function `ipath_init_one':
ipath_driver.c:(.devinit.text+0x1e5bc): undefined reference to `ipath_init_iba7220_funcs'
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 124b4dcb ("IB/ipath: add calls to new 7220 code and enable in
build") inadvertently added core to set dev->class_dev.dev back into
ib_ipath. This is completely redundant since commit 1912ffbb ("IB: Set
class_dev->dev in core for nice device symlink"), which removed
class_dev setting from low-level drivers, and also will break the build
when class_dev is removed completely from struct ib_device.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Describe disable_sma parameter with its name rather than the internal
ib_ipath_disable_sma variable name, so that the description shows up
properly in modinfo.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Acked-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove redundant static declarations of functions that are defined
before they are used in the source.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (36 commits)
SCSI: convert struct class_device to struct device
DRM: remove unused dev_class
IB: rename "dev" to "srp_dev" in srp_host structure
IB: convert struct class_device to struct device
memstick: convert struct class_device to struct device
driver core: replace remaining __FUNCTION__ occurrences
sysfs: refill attribute buffer when reading from offset 0
PM: Remove destroy_suspended_device()
Firmware: add iSCSI iBFT Support
PM: Remove legacy PM (fix)
Kobject: Replace list_for_each() with list_for_each_entry().
SYSFS: Explicitly include required header file slab.h.
Driver core: make device_is_registered() work for class devices
PM: Convert wakeup flag accessors to inline functions
PM: Make wakeup flags available whenever CONFIG_PM is set
PM: Fix misuse of wakeup flag accessors in serial core
Driver core: Call device_pm_add() after bus_add_device() in device_add()
PM: Handle device registrations during suspend/resume
block: send disk "change" event for rescan_partitions()
sysdev: detect multiple driver registrations
...
Fixed trivial conflict in include/linux/memory.h due to semaphore header
file change (made irrelevant by the change to mutex).
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
None of these files use any of the functionality promised by
asm/semaphore.h. It's possible that they rely on it dragging in some
unrelated header file, but I can't build all these files, so we'll have
fix any build failures as they come up.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
The mlx4_ib driver is stable enough for production use, so bump the
version number to 1.0 to indicate this.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Also, introduce a few inline helper functions to make the code more readable.
Signed-off-by: Stefan Roscher <stefan.roscher@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the free_irq() call in nes_remove() to before the tasklet_kill();
otherwise there is a window after tasklet_kill() where a new interrupt
can be handled and reschedule the tasklet, leading to a use-after-free
crash.
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ib_mthca driver has been stable for a while, so bump the version
number to 1.0 to indicate this.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the QP was moved to another state (such as SQE) by the hardware,
then after this change the user won't have to set the IBV_QP_CUR_STATE
mask in order to execute modify QP in order to recover from this state.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the QP was moved to another state (such as SQE) by the hardware,
then after this change the user won't have to set the IBV_QP_CUR_STATE
mask in order to execute modify QP in order to recover from this state.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a place where we might dereference a NULL pointer; this fixes
Coverity CID 1392. On inspection I also found a place where we could
attempt to kmem_cache_free() a NULL pointer, so fix this too.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Handle IB_WR_SEND_WITH_INV work requests.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds the initialization calls into the new 7220 HCA files,
changes the Makefile to compile and link the new files, and code to
handle send DMA.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The patch adds a number of minor changes to support newer HCAs:
- New send buffer control bits
- New error condition bits
- Locking and initialization changes
- More send buffers
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A new file which allows the IBA7220 send DMA engine to be used from
userland. The routines here are not linked in yet, that will happen in
a follow-on patch...
Signed-off-by: Arthur Jones <arthur.jones@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A new header file which allows the IBA7220 send DMA engine to be used
from userland. The definitions here are not used yet, that will happen
in a follow-on patch...
Signed-off-by: Arthur Jones <arthur.jones@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The IBA7220 HCA has a new feature to DMA data to the on chip send
buffers instead of or in addition to the host CPU doing the data
transfer. This patch adds code to support the send DMA queue.
Signed-off-by: John Gregor <john.gregor@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds binary data to initialize the IB SERDES.
Signed-off-by: Michael Albaugh <Michael.Albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The control and initialization of the SerDes blocks of the IBA7220 is
sufficiently complex to merit a separate file.
Signed-off-by: Michael Albaugh <Michael.Albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds the HCA-specific code for the IBA7220 HCA.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds a new ASIC-specific header file for the HCAs using the IBA7220.
Signed-off-by: Michael Albaugh <Michael.Albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is part of a patch series to add support for a new HCA. This patch
adds new fields to the header files.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch makes chip reset more robust and reduces lock contention
between user and kernel TID register updates.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Newer HCAs support MSI interrupts and also INTx interrupts. Fix the
code so that INTx can be reliably enabled if MSI interrupts are not
working.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Newer HCAs have a threshold counter to reduce the number of DMAs the
chip makes to update the PIO buffer availability status bits. This
patch enables the feature.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Whenever the LID is set, notify the HCA specific code so that the
appropriate HW registers can be updated. Also log the info on the
console at low priority.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds code to enable/disable the IBTA 1.2 heartbeat for testing
if the HCA supports it.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The hardware-based recovery doesn't need any intervention, and in a few
cases we can get a bit confused about state and skip steps such as
turning off the link state LED when we consider recovery to be "down".
So ignore this transition, and either we recover in hardware, or we
transition to down, and will handle it then.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Newer HCAs have a HW option to write a sequence number to each receive
queue entry and avoid a separate DMA of the tail register to memory.
This patch adds support for these changes.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch makes some white space changes and minor non-functional
changes to more closely match the code in OFED-1.3.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch checks for old and new format writes to send a packet via the
diagnostic interface.
Signed-off-by: Michael Albaugh <Michael.Albaugh@Qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Raw comparison against jiffies will fail if jiffies wraps, although
since ipath currently only supports 64-bit architectures, this is rather
far-fetched. Still, it's better to use time_after_eq().
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than have build_mlx_header() return a negative value on failure
and the length of the segments it builds on success, add a pointer
parameter to return the length and return 0 on success. This matches
the calling convention used for build_lso_seg() and generates slightly
smaller code -- eg, on 64-bit x86:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-22 (-22)
function old new delta
mlx4_ib_post_send 2023 2001 -22
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO. The create_flags member will also be useful for XRC
and ehca low-latency QP support.
Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for reading newer card's EEPROMs while continuing to support
older EEPROMs.
Also, add support for the temperature sensor if present.
Signed-off-by: Michael Albaugh <Michael.Albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This reduces the latency for RC ACKs when a PIO buffer is available.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A fixed partitioning of send buffers is determined at driver load time
for user processes and kernel use. Since send buffers are a scarce
resource, it makes sense to allow the kernel to use the buffers if they
are not in use by a user process.
Also, eliminate code duplication for ipath_force_pio_avail_update().
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The link can be put in LINKDOWN_DISABLE state either locally or via a
MAD. However, the link-recovery code will take it out of that state as
a side-effect of attempts to clear SerDes/XGXS issues.
We add a flag to indicate "link is down on purpose, leave it alone."
Signed-off-by: Michael Albaugh <michael.albaugh@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Update the module author to the current email address.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In list iteration code, you normally wouldn't be calling
"container_of()" directly anyway, you'd be invoking "list_entry()".
But you don't even need that here, "list_for_each_entry()" is fine.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The session_id members of struct nes_cm_listener and struct
nes_cm_node are write-only, so remove them. This allows the
session_id member of struct nes_cm_core to be removed as well, since
it is only used to write those other session_id values.
This removes the use of current->tgid (which will be deprecated)
pointed out by Pavel Emelyanov <xemul@openvz.org>.
Acked-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In slave_or_pri_blk(), pci_write_config_byte() is used to write a
16-bit quantity to clear linkctrl CRC error bits. This is clearly a
bug and also causes the warning
drivers/infiniband/hw/ipath/ipath_iba6110.c: In function 'slave_or_pri_blk':
drivers/infiniband/hw/ipath/ipath_iba6110.c:849: warning: overflow in implicit constant conversion
Fix this by using pci_write_config_word() instead.
Acked-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The receive queue number of WRs and SGEs shouldn't be checked if a
SRQ is specified.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Remove useless comment about list removal since locks are held and
the code checks that the QP is on the list before removing it.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Workaround a QLE7140 problem that in rare cases causes flow control
problems after link recovery by forcing a link retrain after recovery.
A module parameter is provided to control the behavior in case it causes
problems.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds code to get/set portinfo to support multiple link speeds
and widths.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There's a conflict between our need to quiesce PSM-based applications
to avoid HoL blocking when the IB link goes down and the apps' desire
to remain running so that their quiescence timout mechanism can keep
running.
The compromise is to STOP the processes for a fixed period of time and
then alternate between CONT and STOP until the link is again active.
If there are poor interactions with subnet manager configuration at a
given site, the interval can be adjusted via a module paramter.
Signed-off-by: John Gregor <john.gregor@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Don't try to handle freeze mode HW errors if the driver is in diagnostic
mode since some tests can cause errors that shouldn't be processed.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The check for link up was incorrect, thus setting the LED display
inconsistently with the link state.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The error recovery code for updating the driver's cached status information
for which send buffers are busy or free wasn't updated for IBA7220.
It should be similar to the initialization code in enable_chip().
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix byte order of value assigned to pioavailshadow. This bug was
detected by sparse endianness warnings.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In mthca_alloc_icm_table(), the number of entries to allocate for the
table->icm array is computed by calculating obj_size * nobj and then
dividing by MTHCA_TABLE_CHUNK_SIZE. If nobj is really large, then
obj_size * nobj may overflow and the division may get the wrong value
(even a negative value). Fix this by calculating the number of
objects per chunk and then dividing nobj by this value instead.
This patch allows crazy configurations such as loading ib_mthca with
the module parameter num_mtt=33554432 to work properly.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mthca_make_profile() returns the size in bytes of the HCA context
layout it creates, or a negative value if an error occurs. However,
the return value is declared as u64 and the memfree initialization
path casts this value to int to test if it is negative. This makes it
think incorrectly than an error has occurred if the context size
happens to be bigger than 2GB, since this turns into a negative int.
Fix this by having mthca_make_profile() return an s64 and testing
for an error by checking whether this 64-bit value itself is negative.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pavel Emelyanov <xemul@openvz.org> mentioned in <http://lkml.org/lkml/2008/3/17/131>
that the task_struct->tgid field is about to become deprecated, so the
uses in the ehca driver need to be fixed up.
However, all the uses in ehca are for some object ownership checking
that is not really needed, and anyway is implementing a policy that
should be in common code rather than a low-level driver. So just
remove all the checks.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Enable use of 4KB MTU. Since the driver uses more pinned memory for
receive buffers when the 4KB MTU is enabled, whether or not the fabric
supports that MTU, add a "mtu4096" module parameter that can be used to
limit the MTU to 2KB when it is known that 4KB MTUs can't be used
anyway.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code was checking if units are present, but not that present units
were usable (link up, etc.)
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Modern I/O buses like PCIe and HT can be configured for multiple speeds
and widths. When an ipath HCA seems to have lower than expected
performance, it is very useful to be able to display what the driver
thinks the bus speed is.
Signed-off-by: Dave Olson <dave.olson@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch makes some constants chip-specific, and makes some related
changes to prepare for supporting another HCA.
Signed-off-by: Dave Olson <dave.olson@qlogic.com
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Recent sparse versions and kernel cleanups knock down the false positive
rate of the ipath driver code to a point where having it be sparse clean
is worthwhile. Here we fixup the sparse warnings. Some of these warnings
(and the impetus to run sparse again) are due to work by Roland Dreier.
Signed-off-by: Arthur Jones <arthur.jones@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Arbel and Sinai devices support checksum generation and verification
of TCP and UDP packets for UD IPoIB messages. This patch checks if
the HCA supports this and sets the IB_DEVICE_UD_IP_CSUM capability
flag if it does. It implements support for handling the IB_SEND_IP_CSUM
send flag and setting the csum_ok field in receive work completions.
Signed-off-by: Eli Cohen <eli@mellnaox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX devices support checksum generation and verification of TCP
and UDP packets for UD IPoIB messages. This patch checks if the HCA
supports this and sets the IB_DEVICE_UD_IP_CSUM capability flag if it
does. It implements support for handling the IB_SEND_IP_CSUM send
flag and setting the csum_ok field in receive work completions.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Ali Ayub <ali@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
__FUNCTION__ is gcc-specific, use __func__ instead.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Allow the compiler to optimize better and generate smaller code:
add/remove: 0/6 grow/shrink: 2/0 up/down: 1528/-1864 (-336)
function old new delta
.ehca_set_pagebuf 1344 2172 +828
.ehca_probe 2312 3012 +700
ehca_set_pagebuf_phys 24 - -24
ehca_set_pagebuf_fmr 24 - -24
ehca_init_device 24 - -24
.ehca_set_pagebuf_fmr 480 - -480
.ehca_set_pagebuf_phys 512 - -512
.ehca_init_device 800 - -800
Also this fixes warnings like:
drivers/infiniband/hw/ehca/ehca_mrmw.c:2015:5: warning: symbol 'ehca_set_pagebuf_fmr' was not declared. Should it be static?
Signed-off-by: Roland Dreier <rolandd@cisco.com>
On some platforms, eg sparc64, dma_addr_t is not the same size as a
pointer, so printing dma_addr_t values by casting to void * and using
a %p format generates warnings. Fix this by casting to unsigned long
and using %lx instead. This fixes the warnings:
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_setup_virt_qp':
drivers/infiniband/hw/nes/nes_verbs.c:1047: warning: cast to pointer from integer of different size
drivers/infiniband/hw/nes/nes_verbs.c:1078: warning: cast to pointer from integer of different size
drivers/infiniband/hw/nes/nes_verbs.c:1078: warning: cast to pointer from integer of different size
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_reg_user_mr':
drivers/infiniband/hw/nes/nes_verbs.c:2657: warning: cast to pointer from integer of different size
Reported by Andrew Morton <akpm@linux-foundation.org>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
nes_unregister_ofa_device() dereferences the nesibdev pointer before
testing if it's NULL. Also, the test is doubly redundant because the
only caller of nes_unregister_ofa_device() is nes_destroy_ofa_device(),
which already tests if nesibdev is NULL. Remove the unnecessary test.
This was spotted by the Coverity checker (CID 2190).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The struct mlx4_interface.event() method was supposed to get an enum
mlx4_dev_event, but the driver code was actually passing in the
hardware enum mlx4_event values. Fix up the callers of
mlx4_dispatch_event() so that they pass in the right type of value,
and fix up the event method in mlx4_ib so that it can handle the enum
mlx4_dev_event values.
This eliminates the need for the subtype parameter to the event
method, so remove it.
This also fixes the sparse warning
drivers/net/mlx4/intf.c:127:48: warning: mixing different enum types
drivers/net/mlx4/intf.c:127:48: int enum mlx4_event versus
drivers/net/mlx4/intf.c:127:48: int enum mlx4_dev_event
Signed-off-by: Roland Dreier <rolandd@cisco.com>
None of the cqp_reqs_XXX counters were ever used anywhere, and neither
was the nics_per_function variable.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix
drivers/infiniband/hw/ipath/ipath_init_chip.c:526:10: warning: symbol 'val' shadows an earlier one
drivers/infiniband/hw/ipath/ipath_init_chip.c:473:6: originally declared here
by giving the second val a different name.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Arthur Jones <arthur.jones@qlogic.com>
There's no reason for the third parameter of ipath_count_units() to be
a u32 *, so change it to be an int * instead. This fixes the sparse
warning:
drivers/infiniband/hw/ipath/ipath_file_ops.c:1654:47: warning: incorrect type in argument 3 (different signedness)
drivers/infiniband/hw/ipath/ipath_file_ops.c:1654:47: expected unsigned int [usertype] *maxportsp
drivers/infiniband/hw/ipath/ipath_file_ops.c:1654:47: got int *<noident>
Signed-off-by: Arthur Jones <arthur.jones@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix sparse warnings about pointer signedness by using a signed int when
calling idr_get_new_above().
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Write tests for NULL pointers as
if (!ptr)
instead of
if (ptr == 0UL)
to fix sparse warnings.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Because of a typo in iwch_accept_cr(), the cxgb3 connection handling
code programs the hardware IRD (incoming RDMA read queue depth) with
the value that is passed in for the ORD (outgoing RDMA read queue
depth). In particular this means that if an application passes in IRD
> 0 and ORD = 0 (which is a completely sane and valid thing to do for
an app that expects only incoming RDMA read requests), then the
hardware will end up programmed with IRD = 0 and the app will fail in
a mysterious way.
Fix this by using "ep->ird" instead of "ep->ord" in the intended place.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the calculation of the MSS for RDMA connections: we need to
allow space in frames for a VLAN tag too.
Signed-off-by: Chien Tung <ctung@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Reset the retry counter when we get a good RDMA_READ_RESPONSE_MIDDLE
packet. This fix will prevent the requester from reporting a retry
exceeded error too early.
Signed-off-by: Patrick Marchand Latifi <patrick.latifi@qlogic.com>
A work completion entry could be placed on the wrong completion
queue when an RC QP is placed in the error state.
Signed-off-by: Patrick Marchand Latifi <patrick.latifi@qlogic.com>
Acked-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch fixes the initialization of RC QPs, since we would rely on
the queue pair type (ibqp->qp_type) being set, but this field is only
initialized when we return from ipath_create_qp (it is initialized by
the user-level verbs library).
The fix is to not depend on this field to initialize the send and
the receive state of the RC QP.
Signed-off-by: Patrick Marchand Latifi <patrick.latifi@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There can be a case where the requester's rnr retry counter
(s_rnr_retry) is less than the number of rnr retries allowed per QP
(s_rnr_retry_cnt). This can happen if the s_rnr_retry counter is being
decremented and an ipath_query_qp call is issued during that time frame.
The fix is to always return the number of rnr retries allowed per QP
instead of the requester's rnr counter.
Found by code review.
Signed-off-by: Patrick Marchand Latifi <patrick.latifi@qlogic.com>
Acked-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Subnet manager SetPortinfo messages distingush between changing the link
state (DOWN, ARM, ACTIVE) and the link physical state (POLL, SLEEP,
DISABLED). These are somewhat independent commands and affect when link
width and speed changes take effect. Without this patch, a link DOWN
physical state NOP command was causing the link width and speed settings
to take effect which should only happen when the link physical state is
goes down (either by a SMP or some link physical error like link errors
exceeding the threshold).
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The cxbg3 driver is unnecessarily decreasing the number of CQ entries by
one when creating a CQ. This will cause the CQ not to have as many
entries as requested by the user if the user requests a power of 2 size.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set cap.max_inline_data to the actual max inline data that the adapter
support, so that userspace apps see the right value returned.
Signed-off-by: Jon Mason <jon@opengridcomputing.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Interrupt moderation low threshold value was incorrectly triggering,
indicating that the threshold should be lowered.
The impact was the timer was likely to become 40usecs and get stuck
there. The biggest side effect was too many interrupts and nonoptimal
performance.
Signed-off-by: John Lacombe <jlacombe@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
With commit ef19454b ("[LIB] crc32c: Keep intermediate crc state in
cpu order"), the behavior of crc32c changes on big-endian platforms.
Our algorithm expects the previous behavior; otherwise we have RDMA
connection establishment failure on big-endian platforms like powerpc.
Apply cpu_to_le32() to value returned by crc32c() to get the previous
behavior.
Signed-off-by: Faisal Latif <flatif@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Just delete the debugging statement so we don't use cqp_request after
freeing it. Adrian Bunk flagged this use-after-free issue spotted by
the Coverity checker.
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a check-after-use spotted by the Coverity checker.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix a memory leak spotted by the Coverity checker.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix an off-by-one spotted by the Coverity checker.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Adrian Bunk pointed out that a Coverity scan found some apparently
dead code in nes_verbs.c that really shouldn't have been dead.
The function nes_create_cq() was missing the assignment
err = 1;
just prior to an iteration that conditionally set err = 0 if a PBL was
found for a given virtual CQ. I also noticed we should have been
returning -EFAULT on a couple related error paths.
Signed-off-by: Chien Tung <ctung@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A single entry (addr 0x10001000, size 0x2000) will get converted to
page address 0x10000000 with a page size of 0x4000. The code as it
stands doesn't address the single buffer case, but in fact it allows
the subsequent single-buffer special case to be eliminated entirely.
Because the mask now includes the (page adjusted) starting and ending
addresses, the general case works for the single buffer case as well.
Signed-off-by: Bryan Rosenburg <rosnbrg@us.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When mthca_fmr_alloc() returns an error, it should free the MPT at the
index key, not mr->ibmr.lkey, since the lkey has been mangled by
hw_index_to_key() and no longer is the real index. This bug causes
corruption of the MPT table free bitmap when mthca_fmr_alloc() fails.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In nes_create_qp(), the test
if (nesqp->mmap_sq_db_index > NES_MAX_USER_WQ_REGIONS) {
is used to error out if the db_index is too large; however, if the
test doesn't trigger, then the index is used as
nes_ucontext->mmap_nesqp[nesqp->mmap_sq_db_index] = nesqp;
and mmap_nesqp is declared as
struct nes_qp *mmap_nesqp[NES_MAX_USER_WQ_REGIONS];
which leads to an array overrun if the index is exactly equal to
NES_MAX_USER_WQ_REGIONS. Fix this by bailing out if the index is
greater than or equal to NES_MAX_USER_WQ_REGIONS.
This was spotted by the Coverity checker (CID 2162).
Acked-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We need to account for the VLAN header size in nes_netdev_change_mtu()
and nes_netdev_init(). Also, add spin lock/unlock during VLAN RX
registration so only one process can assign VLAN group for a given
interface at a time.
Signed-off-by: Chien Tung <ctung@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Only mask out MAC interrupt if necessary and re-enable on ifup. There
could be multiple netdevs going through the same MAC. MAC interrupts
should not be masked off until the last netdev is downed.
Signed-off-by: Chien Tung <ctung@neteffect.com>
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently mlx4_ib_fmr_alloc() calls mlx4_mr_enable() instead of
mlx4_fmr_enable(). The two functions are equivalent at the moment, but
this is not really correct (and the change is needed to fix a bug).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
replace:
big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) +
expression_in_cpu_byteorder);
with:
beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder);
Generated with a semantic patch.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The cxgb3 HW and driver don't support loopback RDMA connections. So
fail any connection attempt where the destination address is local.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Usually harmless, since the scatterlist is always hard-coded to a length
of 1, but it triggers a BUG() if CONFIG_DEBUG_SG=y, so we better fix it.
This fixes <http://bugzilla.kernel.org/show_bug.cgi?id=9934>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX HCA supports shrinking WQEs, so that a single work request
can be made of multiple units of wqe_shift. This way, WRs can differ
in size, and do not have to be a power of 2 in size, saving memory and
speeding up send WR posting. Unfortunately, if we do this then the
wqe_index field in CQEs can't be used to look up the WR ID anymore, so
our implementation does this only if selective signaling is off.
Further, on 32-bit platforms, we can't use vmap() to make the QP
buffer virtually contigious. Thus we have to use constant-sized WRs to
make sure a WR is always fully within a single page-sized chunk.
Finally, we use WRs with the NOP opcode to avoid wrapping around the
queue buffer in the middle of posting a WR, and we set the
NoErrorCompletion bit to avoid getting completions with error for NOP
WRs. However, NEC is only supported starting with firmware 2.2.232,
so we use constant-sized WRs for older firmware. And, since MLX QPs
only support SEND, we use constant-sized WRs in this case.
When stamping during NOP posting, do stamping following setting of the
NOP WQE valid bit.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We use struct mlx4_buf for kernel QP, CQ and SRQ buffers, and the code
to look up an entry is duplicated in get_cqe_from_buf() and the QP and
SRQ versions of get_wqe(). Factor this out into mlx4_buf_offset().
This will also make it easier to switch over to using vmap() for buffers.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a standard NIC and RDMA/iWARP driver for NetEffect 1/10Gb ethernet adapters.
Signed-off-by: Glenn Streiff <gstreiff@neteffect.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the allocation of the MTT or the mailbox failed, mthca_fmr_alloc()
would return 0 (success) no matter what. This leads to crashes a
little down the road, when we try to dereference eg mr->mtt, which was
really ERR_PTR(-Ewhatever).
Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The string mlx4_ib_version was defined, but never used. Print out the
version once when the first device is initialized.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We have recently discovered that Tavor mode requires each WQE in a
posted list of receive WQEs to have a valid NDA field at all times.
This requirement holds true for regular QPs as well as for SRQs. This
patch prelinks the receive queue in a regular QP and keeps the free
list in SRQ always properly linked.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Reviewed-by: Jack Morgenstein <jackm@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The SRQ receive posting functions make sure that srq->first_free never
becomes negative, so we can remove tests of whether it is negative.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The firmware QUERY_ADAPTER command does not return vendor_id,
device_id, and revision_id; eliminate these fields from the query.
Initialize the rev_id field of the mlx4 device via init_node_data (MAD
IFC query), as is done in the query_device verb implementation.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
For memfree devices, the firmware QUERY_ADAPTER command does not
return vendor_id, device_id, and revision_id; do not return these
fields in the QUERY_ADAPTER function for memfree devices.
Instead, for memfree devices, initialize the rev_id field of the mthca
device via init_node_data (MAD IFC query), as is done in the
query_device verb implementation.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In mthca_reg_phys_mr(), we calculate the page size for the HCA
hardware to use to map the buffer list passed in by the consumer.
For example, if the consumer passes in
[0] addr 0x1000, size 0x1000
[1] addr 0x2000, size 0x1000
then the algorithm would come up with a page size of 0x2000 and a list
of two pages, at 0x0000 and 0x2000. Usually, this would work fine
since the memory region would start at an offset of 0x1000 and have a
length of 0x2000.
However, the old code did not take into account the alignment of the
IO virtual address passed in. For example, if the consumer passed in
a virtual address of 0x6000 for the above, then the offset of 0x1000
would not be used correctly because the page mask of 0x1fff would
result in an offset of 0.
We can fix this quite neatly by making sure that the page shift we use
is no bigger than the first bit where the start of the first buffer
and the IO virtual address differ. Also, we can further simplify the
code by removing the special case for a single buffer by noticing that
it doesn't matter if we use a page size that is too big. This allows
the loop to compute the page shift to be replaced with __ffs().
Thanks to Bryan S Rosenburg <rosnbrg@us.ibm.com> for pointing out the
original bug and suggesting several ways to improve this patch.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch enables ehca to redirect any PMA queries to the
actual PMA QP.
Signed-off-by: Hoang-Nam Nguyen <hnguyen@de.ibm.com>
Reviewed-by: Joachim Fenkes <fenkes@de.ibm.com>
Reviewed-by: Christoph Raisch <raisch@de.ibm.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>