Previously, vlan id and source MAC were used from QP attributes. Since
the net device is now stored in the GID attributes, they could be used
instead of getting this information from the QP attributes.
IB_QP_SMAC, IB_QP_ALT_SMAC, IB_QP_VID and IB_QP_ALT_VID were removed
because there is no known libibverbs that uses them.
This commit also modifies the vendors (mlx4, ocrdma) drivers in order
to use the new approach.
ocrdma driver changes were done by Somnath Kotur <Somnath.Kotur@Avagotech.Com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Adding an ability to query the IB cache by a netdev and get the
attributes of a GID. These parameters are necessary in order to
successfully resolve the required GID (when the netdevice is known)
and get the Ethernet L2 attributes from a GID.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK is now supported downstream.
In addition, this flag was supported only for IB_QPT_UD, now, with the
new implementation it is supported for all QP types.
Support IB_USER_VERBS_EX_CMD_CREATE_QP in order to get the flag from
user space using the extension create qp command.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Current implementation for MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK is not
supported when link layer is Ethernet.
This patch will add counter based implementation for multicast loopback
prevention. HW can drop multicast loopback packets if sender QP counter
index is equal to receiver QP counter index. If qp flag
MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK is set and link layer is Ethernet,
create a new counter and attach it to the QP so it will continue
receiving multicast loopback traffic but it's own.
The decision if to create a new counter is being made at the qp
modification to RTR after the QP's port is set. When QP is destroyed or
moved back to reset state, delete the counter.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This is an infrastructure step for allocating and attaching more than
one counter to QPs on the same port. Allocate a counters table and
manage the insertion and removals of the counters in load and unload of
mlx4 IB.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This patch split up struct ib_send_wr so that all non-trivial verbs
use their own structure which embedds struct ib_send_wr. This dramaticly
shrinks the size of a WR for most common operations:
sizeof(struct ib_send_wr) (old): 96
sizeof(struct ib_send_wr): 48
sizeof(struct ib_rdma_wr): 64
sizeof(struct ib_atomic_wr): 96
sizeof(struct ib_ud_wr): 88
sizeof(struct ib_fast_reg_wr): 88
sizeof(struct ib_bind_mw_wr): 96
sizeof(struct ib_sig_handover_wr): 80
And with Sagi's pending MR rework the fast registration WR will also be
down to a reasonable size:
sizeof(struct ib_fastreg_wr): 64
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
Tested-by: Haggai Eran <haggaie@mellanox.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing
read and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc. mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV7v8wAAoJELgmozMOVy/d2dcP/3PXnGFPgFGJODKE6VCZtTvj
nooNXRKXjxv470UT5DiAX7SNcBxzzS7Zl/Lj+831H9iNXUyzuH31KtBOAZ3W03vZ
yXwCB2caOStSldTRSUUvPe2aIFPnyNmSpC4i6XcJLJMCFijKmxin5pAo8qE44BQU
yjhT+wC9P6LL5wZXsn/nFIMLjOFfu0WBFHNp3gs5j59paxlx5VeIAZk16aQZH135
m7YCyicwrS8iyWQl2bEXRMon2vlCHlX2RHmOJ4f/P5I0quNcGF2+d8Yxa+K1VyC5
zcb3OBezz+wZtvh16yhsDfSPqHWirljwID2VzOgRSzTJWvQjju8VkwHtkq6bYoBW
egIxGCHcGWsD0R5iBXLYr/tB+BmjbDObSm0AsR4+JvSShkeVA1IpeoO+19162ixE
n6CQnk2jCee8KXeIN4PoIKsjRSbIECM0JliWPLoIpuTuEhhpajftlSLgL5hf1dzp
HrSy6fXmmoRj7wlTa7DnYIC3X+ffwckB8/t1zMAm2sKnIFUTjtQXF7upNiiyWk4L
/T1QEzJ2bLQckQ9yY4v528SvBQwA4Dy1amIQB7SU8+2S//bYdUvhysWPkdKC4oOT
WlqS5PFDCI31MvNbbM3rUbMAD8eBAR8ACw9ZpGI/Rffm5FEX5W3LoxA8gfEBRuqt
30ZYFuW8evTL+YQcaV65
=EHLg
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.
There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.
Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.
I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.
Summary of changes for 4.3:
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...
Implements the IB core disassociate_ucontext API. The driver detaches the HW
resources for a given user context to prevent a dependency between application
termination and device disconnecting. This is done by managing the VMAs that
were mapped to the HW bars such as door bell and blueflame. When need to detach
remap them to an arbitrary kernel page returned by the zap API.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
When handling a device internal error, the driver is responsible to
drain the completion queue with flush errors.
In case a completion queue was assigned to multiple send queues, the
driver iterates over the send queues and generates flush errors of
inflight wqes. The driver must correctly pass the wc array with an
offset as a result of the previous send queue iteration. Not doing so
will overwrite previously set completions and return a wrong number
of polled completions which includes ones which were not correctly set.
Fixes: 35f05dabf9 (IB/mlx4: Reset flow support for IB kernel ULPs)
Signed-off-by: Ariel Nahum <arieln@mellanox.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mlx4 IB driver implementation for ib_query_ah used a wrong offset
(28 instead of 29) when link type is Ethernet. Fixed to use the correct one.
Fixes: fa417f7b52 ('IB/mlx4: Add support for IBoE')
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The pkey mapping for RoCE must remain the default mapping:
VFs:
virtual index 0 = mapped to real index 0 (0xFFFF)
All others indices: mapped to a real pkey index containing an
invalid pkey.
PF:
virtual index i = real index i.
Don't allow users to change these mappings using files found in
sysfs.
Fixes: c1e7e46612 ('IB/mlx4: Add iov directory in sysfs under the ib device')
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The mcg "too many pending requests" warning message fills the log
when OpenSM is downed. Demote the message from warning level to
debug level.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
send_mad_to_wire takes the same spinlock that is taken in
the interrupt context. Therefore, it needs irqsave/restore.
Fixes: b9c5d6a643 ('IB/mlx4: Add multicast group (MCG) paravirtualization for SR-IOV')
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Manage RoCE gid table with logic in IB/core, which is common to all
vendors, and remove the mechanism from the mlx4 IB driver.
Since management of the GID cache may lead to index mismatch with the
hardware GID table, a translation between indexes is required when
modifying a QP or creating an address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
get_netdev: get the net_device on the physical port of the IB transport port. In
port aggregation mode it is required to return the netdev of the active port.
modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
the harsware GID table. It is possible that indexes in cahce and hardware tables
won't match so a translation is required when modifying a QP or creating an
address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Applications must not assume that max_sge and max_sge_rd are the same,
Hence expose max_sge_rd correctly as well.
Reported-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
To add Hardware accelerated support in 802.1ad vlan, replace
Current VLAN macros to CVLAN.
Replace:
MLX4_WQE_CTRL_INS_VLAN
MLX4_CQE_VLAN_PRESENT_MASK
With:
MLX4_WQE_CTRL_INS_CVLAN
MLX4_CQE_CVLAN_PRESENT_MASK
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is little chance our memory allocation will fail, so we can
combine initializing the work structs with allocating them instead of
looping through all of them once to allocate and again to initialize.
Then when we need to actually find out if our device is up or in the
process of going down, have all of our work structs batched up, take the
spin_lock once and only once, and do all of the batch under the one
spin_lock invocation instead of incurring all of the locked memory cycles
we would otherwise incur to take/release the spin_lock over and over
again.
Signed-off-by: Doug Ledford <dledford@redhat.com>
We create a number of work structs to be queued up to a workqueue, and
on completion of the workqueue handler, the workqueue handler frees the
allocated memory. If, however, we don't queue the work struct because
the device is going down, then we need to free the memory ourselves.
Signed-off-by: Doug Ledford <dledford@redhat.com>
On failure, we loop through all possible pointers and test them before
calling kfree. But really, why even attempt to free items we didn't
allocate when we can easily loop through exactly and only the devices
for which the original memory allocation succeeded and free just those.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
For IB links, reading HCA flow counters through iboe_process_mad() should
be used when mlx4_ib_process_mad() is invoked only for VFs PMA queries and
exactly nothing else.
Fixes: 7193a141eb ('IB/mlx4: Set VF to read from QP counters')
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
mlx4 VFs can provide CQE raw time-stamping services, but they
don't have the hca core clock mapped to their PCI bars.
As such, we should not attempt to query and report the clock offset
to user space for VFs. Doing so causes query_device over VFs to fail
with -ENOSUPP.
Fixes: 4b664c4355 ('IB/mlx4: Add support for CQ time-stamping')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
We recently added BUG_ON's which were inappropriate for a condition which
should never happen. Change these to be WARN_ON_ONCE as a debugging aid.
Fixes: 4cd7c9479a ('IB/mad: Add support for additional MAD info to/from drivers')
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
This is an infrastructure step for querying VF and PF counters.
This code was in the IB driver, move it to the mlx4 core driver
so it will be accessible for more use cases.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As IB VFs are not capable to read the port counters through MADs,
move there to read their own QP counters to gather statistics.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is an infrastructure step to attach all the QPs opened from the
IB driver to a counter in order to collect VF stats from the PF using
those counters.
If the port's type is Ethernet, the counter policy demands two counters
per port (one for RoCE and one for Ethernet). The port default counter
(allocated in mlx4_core) is used for the Ethernet netdev QPs and we
allocate another counter for RoCE.
If the port's traffic is Infiniband, the counter policy demands
one counter per port, so it can use the port's default counter.
Also, Add 'allocated' flag for each counter in order to clean it at
unload.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reserve the last valid counter index for "sink" counter, when a
new counter cannot be allocated, the driver will use this counter.
In order to avoid allocating this counter on any other flow, fix the
indices bitmap allocation range, and reserve the sink counter index.
Add macro for the sink counter index and replace all appearences of the
index with the macro.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support alternate sized MADs (and variable sized MADs on OPA
devices) add in/out MAD size parameters to the process_mad core call.
In addition, add an out_mad_pkey_index to communicate the pkey index the driver
wishes the MAD stack to use when sending OPA MAD responses.
The out MAD size and the out MAD PKey index are required by the MAD
stack to generate responses on OPA devices.
Furthermore, the in and out MAD parameters are made generic by specifying them
as ib_mad_hdr rather than ib_mad.
Drivers are modified as needed and are protected by BUG_ON flags if the MAD
sizes passed to them is incorrect.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add max MAD size to the device immutable data set and have all drivers that
support MADs report the current IB MAD size (IB_MGMT_MAD_SIZE) to the core.
Verify MAD size data in both the MAD core and when reading the immutable data.
OPA drivers will report alternate MAD sizes in subsequent patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In preparation to support the new OPA MAD Base version, add a base version
parameter to ib_create_send_mad and set it to IB_MGMT_BASE_VERSION for current
users.
Definition of the new base version and it's processing will occur in later
patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This includes:
* support allocation of CQ with the TIMESTAMP_COMPLETION creation flag.
* add timestamp_mask and hca_core_clock to query_device, reporting the
number of supported timestamp bits (mask) and the hca_core_clock frequency.
* return hca core clock's offset in query_device vendor's data,
this is needed in order to read the HCA's core clock.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to read the HCA's cycle counter efficiently in
user space, we need to map the HCA's register.
This is done through mmap call.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Vendors should be able to pass vendor specific data to/from
user-space via query_device uverb. In order to do this,
we need to pass the vendors' specific udata.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, ib_create_cq uses cqe and comp_vecotr instead
of the extendible ib_cq_init_attr struct.
Earlier patches already changed the vendors to work with
ib_cq_init_attr. This patch changes the consumers too.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Add a new ib_cq_init_attr structure which contains the
previous cqe (minimum number of CQ entries) and comp_vector
(completion vector) in addition to a new flags field.
All vendors' create_cq callbacks are changed in order
to work with the new API.
This commit does not change any functionality.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-By: Devesh Sharma <devesh.sharma@avagotech.com> to patch #2
Signed-off-by: Doug Ledford <dledford@redhat.com>
The process_mad device function declares some parameters as "in". Make those
parameters const and adjust the call tree under process_mad in the various
drivers accordingly.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The unwinding clean up code are err_create_flow starts at the current
index i. That means we shouldn't increment i until we're really sure
we won't have to destroy the current flow; otherwise we might
increment the index, fail inside an is_bonded block, and end up
accessing off the end of the reg_id[] array.
This was detected by Coverity (CID 1271229).
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, mlx4_en allocated EQs and used them exclusively.
This affected RoCE performance, as applications which are
events sensitive were limited to use only the legacy EQs.
Change that by introducing an EQ pool. This pool is managed
by mlx4_core. EQs are assigned to ports (when there are limited
number of EQs, multiple ports could be assigned to the same EQs).
An exception to this rule is the ASYNC EQ which handles various events.
Legacy EQs are completely removed as all EQs could be shared.
When a consumer (mlx4_ib/mlx4_en) requests an EQ, it asks for
EQ serving on a specific port. The core driver calculates which
EQ should be assigned to that request.
Because IRQs are shared between IB and Ethernet modules, their
names only include the PCI device BDF address.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In SRIOV, when simple (i.e - Ethernet L2 only) flow steering rules are
created, always create them at MLX4_DOMAIN_NIC priority (instead of
the real priority the function created them at). This is done in order
to let multiple functions add broadcast/multicast rules without
affecting other functions, which is necessary for DPDK in SRIOV.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of enabling single ported VFs over IB ports we need to handle
some of the flows for generting EQ events for VFs which don't come
into play under Eth ports.
This mainly includes port management events derived from changes of the
phyiscal port (lid change, client re-register, down/up, etc), VF pkey table
changes and VF guid changes initiated by the IB driver.
(1) make sure that events are generated only for VFs sitting on
the relevant physical port (under the ALL_SLAVES flow).
(2) before generating the event, convert from physical (one or two)
to VF port (always equals one).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiplexling a MAD sent from VF, we should convert the port used
by the guest to send the packet to the actual physical port which will be
used to transmit the packet, before building the relevant address-handle (AH).
This is needed under VPI for single ported VFs, since the code that builds
the AH (mlx4_ib_query_ah()) makes decisions based on the input port. If we
use the port number provided by the guest, it might have different protocol
vs. the one this packat has to go from, and hence the result could be wrong.
So far, the conversion was done after the AH was built and it worked for
single ported Eth VFs which were not enabled under VPI. When adding support
for single ported IB VFs and VPI, we hit that.
Fixes: 449fc48866 ('net/mlx4: Adapt code for N-Port VF')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As of commit 5eb620c81c "IB/core: Add helpers for uncached GID and P_Key
searches"; pkey_tbl_len and gid_tbl_len are immutable data which are stored in
the ib_device.
The per port core capability flags to be added later are also immutable data to
be stored in the ib_device object.
In preparation for this create a structure for per port immutable data and
place the pkey and gid table lengths within this structure.
"get_port_immutable" is added as a mandatory device function to allow the
drivers to fill in this data.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
These KERN_<LEVEL> uses are unnecessary with pr_<level> and cause
bad logging output so remove them.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Since ib_dma_map_single can fail use ib_dma_mapping_error to check
for errors.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The current code decreases from the mss size (which is the gso_size
from the kernel skb) the size of the packet headers.
It shouldn't do that because the mss that comes from the stack
(e.g IPoIB) includes only the tcp payload without the headers.
The result is indication to the HW that each packet that the HW sends
is smaller than what it could be, and too many packets will be sent
for big messages.
An easy way to demonstrate one more aspect of the problem is by
configuring the ipoib mtu to be less than 2*hlen (2*56) and then
run app sending big TCP messages. This will tell the HW to send packets
with giant (negative value which under unsigned arithmetics becomes
a huge positive one) length and the QP moves to SQE state.
Fixes: b832be1e40 ('IB/mlx4: Add IPoIB LSO support')
Reported-by: Matthew Finlay <matt@mellanox.com>
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change the default mode to be HOST assigned instead of SM assigned. This is
the expected operational mode, because it doesn't depend on SM availability.
As PF generates random GUIDs as the initial admin values, this gives
out of the box experience.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Request GIDs from the SM on demand, i.e., when a VF actually needs them,
and release them when the GIDs are no longer in use.
In cloud environments, this is useful for GID migrations, in which a
GID is assigned to a VF on the destination HCA, while the VF on the
source HCA is shutdown (but the GID was not administratively released).
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Change the init flow to ask GUIDs only for active VFs. This is done for
both SM & HOST modes so that there is no need any more to maintain the
ownership record type.
In case SM mode is used, the initial value will be 0, ask the SM to assign,
for the HOST mode the initial value will be the HOST generated GUID.
This will enable out of the box experience for both probed and attached VFs.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Set the admin alias GUID per the administrator's request via the sysfs
mechanism into the core layer.
The "get" request returns the current value. However, if the administrator
requests the SM to assign a new value by requesting 0, the SM assigned
GUID is returned.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
If the SM rejects an alias GUID request the PF driver keeps trying to acquire
the specified GUID indefinitely, utilizing an exponential backoff scheme.
Retrying is managed per GUID entry. Each entry that wasn't applied holds its
next retry information. Retry requests to the SM consist of records of 8
consecutive GUIDS. Each record that contains GUIDs requiring retries holds its
next time-to-run based on the retry information of all its GUID entries. The
record having the lowest retry time will run first when that retry time
arrives.
Since the method (SET or DELETE) as sent to the SM applies to all the GUIDs in
the record, we must handle SET requests and DELETE requests in separate SM
messages (one for SETs and the other for DELETEs).
To avoid race conditions where a GUID entry request (set or delete) was
modified after the SM request was sent, we save the method and the requested
indices as part of the callback's context -- thus, only the requested indexes
are evaluated when the response is received.
When an GUID entry is approved we turn off its retry-required bit, this
prevents redundant SM retries from occurring on that record.
The port down event should be sent only when previously it was up. Likewise,
the port up event should be sent only if previously the port was down.
Synchronization was added around the flows that change entries and record state
to prevent race conditions.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
The calls to SET_PORT used hard-code numbers, when supplying command's
opcode modifiers, fix that to use well defined constants.
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For RoCE ports, we set the u32 PMA values based on u64 HCA counters. In case of
overflow, according to the IB spec, we have to saturate a counter to its
max value, do that.
Fixes: c37791349c ('IB/mlx4: Support PMA counters for IBoE')
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Processing an event is done in a different context from the one when
the event was dispatched. This requires a check that the slave
net device is still valid when the event is being processed. The check is done
under the iboe lock which ensure correctness.
Fixes: a575009030 ('IB/mlx4: Add port aggregation support')
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Re-enable on-demand paging changes with stable ABI
- Fairly large set of ocrdma HW driver fixes
- Some qib HW driver fixes
- Other miscellaneous changes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJU52nHAAoJEENa44ZhAt0hkycP/0vwYNl0JJadasSUrLm2AWje
iAePU+K7uiyxSuIU+eLnyyDV/iQ2sCjXfhQNE6FnFhH3JjwBYbhS2Q8WcjwfRWYX
iFhORQltb3spSeLdT7N1QfkMF4MtAw3ENPE3QKIgNaIQSso+J256BHrSiqr9Akwe
hwIVr0TLDO59ggPs3c083uUhC+5AMViMVgRR+N9+/59WHz2vG2WsTunpg1sYOAgC
KpUhfZW4za5xYJ0c8BOnPiSAfJasD1UdDg3oX0RPD4j/diiWAO6EOR+jkOLtODpd
8uhD8HM7ZvCKE1io5QnVdjTR/n51NaVGHk+yfWJRSvV1sw1AALtU4NVniI+E5mFs
Fwe+zUzbQ3j08TtrU0VjaeteBh2oGC0pJlkJS9+HXeDmH30/LQCMCiA5mBSSEPDg
0cPEAJgGAZqOIyyZe8jSslW/iN0cE6FDDb8+/1AZ80IfbdMxXh9FVwi7cdXn8+OQ
nXmtnSa7yzKWS9VVXDrvSzI0Y2oDv8DSDpCWGCm7bUcDXd/T5LpPR3RdSorGkYAr
O2zmuJZlCdkuKgW91O7cztNj2hTlDnJ+e+5P05KwPOb86Muum6tphLJd5b1h970X
Actx/6EX/ycSQ3ukiKW7Ksn2bYD3RLZvdbPYQM5xUjlGGWPF6nvmbZ8MqnyaLFfE
WKvx39DMGq3ktofiMkbf
=JLsF
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA updates from Roland Dreier:
- Re-enable on-demand paging changes with stable ABI
- Fairly large set of ocrdma HW driver fixes
- Some qib HW driver fixes
- Other miscellaneous changes
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (43 commits)
IB/qib: Add blank line after declaration
IB/qib: Fix checkpatch warnings
IB/mlx5: Enable the ODP capability query verb
IB/core: Add on demand paging caps to ib_uverbs_ex_query_device
IB/core: Add support for extended query device caps
RDMA/cxgb4: Don't hang threads forever waiting on WR replies
RDMA/ocrdma: Fix off by one in ocrdma_query_gid()
RDMA/ocrdma: Use unsigned for bit index
RDMA/ocrdma: Help gcc generate better code for ocrdma_srq_toggle_bit
RDMA/ocrdma: Update the ocrdma module version string
RDMA/ocrdma: set vlan present bit for user AH
RDMA/ocrdma: remove reference of ocrdma_dev out of ocrdma_qp structure
RDMA/ocrdma: Add support for interrupt moderation
RDMA/ocrdma: Honor return value of ocrdma_resolve_dmac
RDMA/ocrdma: Allow expansion of the SQ CQEs via buddy CQ expansion of the QP
RDMA/ocrdma: Discontinue support of RDMA-READ-WITH-INVALIDATE
RDMA/ocrdma: Host crash on destroying device resources
RDMA/ocrdma: Report correct state in ibv_query_qp
RDMA/ocrdma: Debugfs enhancments for ocrdma driver
RDMA/ocrdma: Report correct count of interrupt vectors while registering ocrdma device
...
If a GUID is not found, the 64-bit GUID printed in the message log
warning should converted to host-endian order for printing.
Found by Doug Ledford and Hal Rosenstock. Fix suggested by Hal.
Signed-off-by: Hal Rosenstock <hal@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Before the entries alignment, we need to check that the entries
doesn't exceed the device's max cqe.
2. After the alignment, we need to make sure that the aligned number
doesn't exceed the max cqes+1. The additional cqe is used to denote
that the resizing operation has completed.
3. If the users asks to resize the CQ with entries less than the
oustanding cqes we should fail instead of returning 0.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In case handle_eth_ud_smac_index fails, we need to free the allocated resources.
Fixes: 2f5bb47368 ("mlx4: Add ref counting to port MAC table for RoCE")
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The MLX4_PROT_IB_IPV4 protocol should only be used with RoCEv2 and such.
Removing this wrong usage allows to run multicast applications over RoCE.
Fixes: d487ee7774 ("IB/mlx4: Use IBoE (RoCE) IP based GIDs in the port GID table")
Reported-by: Carol Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The driver exposes interfaces that directly relate to HW state. Upon fatal
error, consumers of these interfaces (ULPs) that rely on completion of
all their posted work-request could hang, thereby introducing dependencies
in shutdown order. To prevent this from happening, we manage the
relevant resources (CQs, QPs) that are used by the device. Upon a fatal error,
we now generate simulated completions for outstanding WQEs that were not
completed at the time the HW was reset.
It includes invoking the completion event handler for all involved CQs so that
the ULPs will poll those CQs. When polled we return simulated CQEs with
IB_WC_WR_FLUSH_ERR return code enabling ULPs to clean up their resources and
not wait forever for completions upon receiving remove_one.
The above change requires an extra check in the data path to make sure that when
device is in error state, the simulated CQEs will be returned and no further
WQEs will be posted.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When attaching a QP to a multicast address in bonded mode, there was an
assumption that the port of the QP must be #1. This assumption isn't the
case under the flow which enables maximal usage of the physical ports.
Fix it by always checking the port of the original flow and create the
mirrored flow on the other port.
Fixes: c6215745b6 ('IB/mlx4: Load balance ports in port aggregation mode')
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the mlx4 IB (RoCE) device works in link aggregation mode, it
exposes a single port to upper layers. Therefore, applications always
set '1' in port_num attribute when modifying a QP or creating an address handle.
To make sure that a node uses all available ports the mlx4 driver will
override the port_num attribute with a round robin policy.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In port aggregation mode flows for port #1 (the only port) should be mirrored
on port #2. This is because packets can arrive from either physical ports.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register the interface with the mlx4 core driver with port aggregation support
and check for port aggregation mode when the 'add' function is called.
In this mode, only one physical port is reported to the upper layer
(RoCE/IB core stack and ULPs).
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is implemented twice... get rid of one copy.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.c
Two simple sets of overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Except for VXLAN steering rules, all offloads should work as they were
under plain DMFS mode. Fix that by enabling all the offloads under
DMFS-A0 mode, except for VXLAN steering rules.
Fixes: d57febe1a4 "net/mlx4: Add A0 hybrid steering"
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This error was detected by sparse static checker:
drivers/infiniband/hw/mlx4/mr.c:226:21: warning: symbol 'err' shadows an earlier one
drivers/infiniband/hw/mlx4/mr.c:197:13: originally declared here
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
A0 hybrid steering is a form of high performance flow steering.
By using this mode, mlx4 cards use a fast limited table based steering,
in order to enable fast steering of unicast packets to a QP.
In order to implement A0 hybrid steering we allocate resources
from different zones:
(1) General range
(2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region.
When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP,
we try hard to allocate the QP from range (2). Otherwise, we try hard not
to allocate from this range. However, when the system is pushed to its
limits and one needs every resource, the allocator uses every region it can.
Meaning, when we run out of raw-eth qps, the allocator allocates from the
general range (and the special-A0 area is no longer active). If we run out
of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that
is also exhausted, the allocator will allocate from the general range
(and the A0 region is no longer active).
Note that if a raw-eth qp is allocated from the general range, it attempts
to allocate the range such that bits 6 and 7 (blueflame bits) in the
QP number are not set.
When the feature is used in SRIOV, the VF has to notify the PF what
kind of QP attributes it needs. In order to do that, along with the
"Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According
to the combination of these bits, the PF tries to allocate a suitable QP.
In order to maintain backward compatibility (with older PFs), the PF
notifies which QP attributes it supports via QUERY_FUNC_CAP command.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields
in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset.
The current Ethernet driver code reserves a Tx QP range with 256b alignment.
This is wrong because if there are more than 64 Tx QPs in use,
QPNs >= base + 65 will have bits 6/7 set.
This problem is not specific for the Ethernet driver, any entity that
tries to reserve more than 64 BF-enabled QPs should fail. Also, using
ranges is not necessary here and is wasteful.
The new mechanism introduced here will support reservation for
"Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
(when hypervisors support WC in VMs). The flow we use is:
1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
and request "BF enabled QPs" if BF is supported for the function
2. In the ALLOC_RES FW command, change param1 to:
a. param1[23:0] - number of QPs
b. param1[31-24] - flags controlling QPs reservation
Bit 31 refers to Eth blueflame supported QPs. Those QPs must have
bits 6 and 7 unset in order to be used in Ethernet.
Bits 24-30 of the flags are currently reserved.
When a function tries to allocate a QP, it states the required attributes
for this QP. Those attributes are considered "best-effort". If an attribute,
such as Ethernet BF enabled QP, is a must-have attribute, the function has
to check that attribute is supported before trying to do the allocation.
In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
which are unsupported. If SRIOV is used, the PF validates those attributes
and masks out unsupported attributes as well. In order to notify VFs which
attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's
mailbox is filled by the PF, which notifies which QP allocation attributes
it supports.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, we've fired all our completion callbacks straight from our ISR.
Some of those callbacks were lightweight (for example, mlx4_en's and
IPoIB napi callbacks), but some of them did more work (for example,
the user-space RDMA stack uverbs' completion handler). Besides that,
doing more than the minimal work in ISR is generally considered wrong,
it could even lead to a hard lockup of the system. Since when a lot
of completion events are generated by the hardware, the loop over those
events could be so long, that we'll get into a hard lockup by the system
watchdog.
In order to avoid that, add a new way of invoking completion events
callbacks. In the interrupt itself, we add the CQs which receive completion
event to a per-EQ list and schedule a tasklet. In the tasklet context
we loop over all the CQs in the list and invoke the user callback.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, the driver queried the firmware in order to get the number
of supported EQs. Under SRIOV, since this was done before the driver
notified the firmware how many VFs it actually needs, the firmware had
to take into account a worst case scenario and always allocated four EQs
per VF, where one was used for events while the others were used for completions.
Now, when the firmware supports the asymmetric allocation scheme, denoted
by exposing num_sys_eqs > 0 (--> MLX4_DEV_CAP_FLAG2_SYS_EQS), we use the
QUERY_FUNC command to query the firmware before enabling SRIOV. Thus we
can get more EQs and MSI-X vectors per function.
Moreover, when running in the new firmware/driver mode, the limitation
that the number of EQs should be a power of two is lifted.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If mlx4_ib_create_flow() attempts to create > 1 rules with the
firmware, and one of these registrations fail, we leaked the
already created flow rules.
One example of the leak is when the registration of the VXLAN ghost
steering rule fails, we didn't unregister the original rule requested
by the user, introduced in commit d2fce8a906 "mlx4: Set
user-space raw Ethernet QPs to properly handle VXLAN traffic".
While here, add dump of the VXLAN portion of steering rules
so it can actually be seen when flow creation fails.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fixes for the new memory region re-registration support
- iSER initiator error path fixes
- Grab bag of small fixes for the qib and ocrdma hardware drivers
- Larger set of fixes for mlx4, especially in RoCE mode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJUIexdAAoJEENa44ZhAt0hP10QAJztxlS2a8U3JCJzthwSYxlI
ohT9487iLk1uEcj4Z3i7w2ERRUzXaHbRTktNHFjwfRb8x2qMUgT2PfD6/30sQ250
nJAk3FRFNipxKkJSfmcc3+O4r91i4F+CaN8DGypaBDHcupeD2drKocl/Iu5MIvkG
e5CzLlS7i/xrWKmgYP4bIqqFZsqQ+2rJrYBDybuLZSaZNd0PTDE3yCDihfOcsxjn
TeOCVbm5895fPRtxzeCGHy8bXbYYN9vItuhtHC+sntYtbhNJhjpmP+1yD6M2SoZR
34sGd7AA1j1H6ATmanzeW2aALkFYPIuGihDbbnRQlDG1v09lEPfP2GtfLxoQ9Ibo
nfe2rsthzV6Qh2xcXjn6KicgV7bb6aSUXEK24zKx7O3MkOvHkOC/JIIrd9dFe+uj
R7pUd3XlAk8SBhTQ4gLub06Dl7ynzSRArwcdMTHp30LvtnjJZoQR67WGGrsdwlIW
MV43105i7iLCcdaSd0ihKnR6OFlSh13Z0wpu+B386bwxkHxjFJXkVHxOJir/iAk9
cW4RXbA/ic7nwIjes4GbMNDOvdJO2tDcg9KGSgiDY3kC5GksPqfxXYVDlMB2rFoE
PhfQ8TOcbZYTmlcKLMpMIFXP484VPhWQJeYWPOf9KGS6aW5QRNPsPCmAvaoSXWLs
GVSlvjbE6O7MgonqG1Jh
=Kpm1
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma fixes from Roland Dreier:
"Last late set of InfiniBand/RDMA fixes for 3.17:
- fixes for the new memory region re-registration support
- iSER initiator error path fixes
- grab bag of small fixes for the qib and ocrdma hardware drivers
- larger set of fixes for mlx4, especially in RoCE mode"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (26 commits)
IB/mlx4: Fix VF mac handling in RoCE
IB/mlx4: Do not allow APM under RoCE
IB/mlx4: Don't update QP1 in native mode
IB/mlx4: Avoid accessing netdevice when building RoCE qp1 header
mlx4: Fix mlx4 reg/unreg mac to work properly with 0-mac addresses
IB/core: When marshaling uverbs path, clear unused fields
IB/mlx4: Avoid executing gid task when device is being removed
IB/mlx4: Fix lockdep splat for the iboe lock
IB/mlx4: Get upper dev addresses as RoCE GIDs when port comes up
IB/mlx4: Reorder steps in RoCE GID table initialization
IB/mlx4: Don't duplicate the default RoCE GID
IB/mlx4: Avoid null pointer dereference in mlx4_ib_scan_netdevs()
IB/iser: Bump version to 1.4.1
IB/iser: Allow bind only when connection state is UP
IB/iser: Fix RX/TX CQ resource leak on error flow
RDMA/ocrdma: Use right macro in query AH
RDMA/ocrdma: Resolve L2 address when creating user AH
mlx4: Correct error flows in rereg_mr
IB/qib: Correct reference counting in debugfs qp_stats
IPoIB: Remove unnecessary port query
...
We had several problems here. First, a race condition on QP1 mac
handling between mlx4_ib_update_qps and mlx4_ib_modify_qp, which is
fixed by taking the qp mutex in mlx4_ib_update_qps.
Also, qp->pri.smac_port was not updated in mlx4_ib_update_qps.
Last, in __mlx4_ib_modify_qp we did not properly handle the case where
the mac is zero, but port is non-zero.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Automatic Path Migration is not supported under RoCE. Therefore,
return a "not-supported" error if the caller attempts to set an
alternate path in a QP context.
In addition, if there are no IB ports configured, do not report
APM capability in the device flags returned by mlx4_ib_query_device.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For native functions (non-SR-IOV), there's no reason to update
the smac_index, as QP1 is a GSI QP.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The source MAC is needed in RoCE when building the QP1 header.
Currently, this is obtained from the source net device. However, the net
device may not yet exist, or can be destroyed in parallel to this QP1 send
operation (e.g through the VPI port change flow) so accessing it may cause
a kernel crash.
To fix this, we maintain a source MAC cache per port for the net device in
struct mlx4_ib_roce. This cached MAC is initialized to be the default MAC
address obtained during HCA initialization via QUERY_PORT. This cached MAC
is updated via the netdev event notifier handler.
Since the cached MAC is held in an atomic64 object, we do not need locking
when accessing it.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When device is being removed (e.g during VPI port link type change
from ETH to IB), tasks for gid table changes should not be executed.
Flush the current queue of tasks and block further tasks from entering the queue.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Chuck Lever reported the following stack trace:
=================================
[ INFO: inconsistent lock state ]
3.16.0-rc2-00024-g2e78883 #17 Tainted: G E
---------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
(&(&iboe->lock)->rlock){+.?...}, at: [<ffffffffa065f68b>] mlx4_ib_addr_event+0xdb/0x1a0 [mlx4_ib]
{SOFTIRQ-ON-W} state was registered at:
[<ffffffff810b3110>] mark_irqflags+0x110/0x170
[<ffffffff810b4806>] __lock_acquire+0x2c6/0x5b0
[<ffffffff810b4bd9>] lock_acquire+0xe9/0x120
[<ffffffff815f7f6e>] _raw_spin_lock+0x3e/0x80
[<ffffffffa0661084>] mlx4_ib_scan_netdevs+0x34/0x260 [mlx4_ib]
[<ffffffffa06612db>] mlx4_ib_netdev_event+0x2b/0x40 [mlx4_ib]
[<ffffffff81522219>] register_netdevice_notifier+0x99/0x1e0
[<ffffffffa06626e3>] mlx4_ib_add+0x743/0xbc0 [mlx4_ib]
[<ffffffffa05ec168>] mlx4_add_device+0x48/0xa0 [mlx4_core]
[<ffffffffa05ec2c3>] mlx4_register_interface+0x73/0xb0 [mlx4_core]
[<ffffffffa05c505e>] cm_req_handler+0x13e/0x460 [ib_cm]
[<ffffffff810002e2>] do_one_initcall+0x112/0x1c0
[<ffffffff810e8264>] do_init_module+0x34/0x190
[<ffffffff810ea62f>] load_module+0x5cf/0x740
[<ffffffff810ea939>] SyS_init_module+0x99/0xd0
[<ffffffff815f8fd2>] system_call_fastpath+0x16/0x1b
irq event stamp: 336142
hardirqs last enabled at (336142): [<ffffffff810612f5>] __local_bh_enable_ip+0xb5/0xc0
hardirqs last disabled at (336141): [<ffffffff81061296>] __local_bh_enable_ip+0x56/0xc0
softirqs last enabled at (336004): [<ffffffff8106123a>] _local_bh_enable+0x4a/0x50
softirqs last disabled at (336005): [<ffffffff810617a4>] irq_exit+0x44/0xd0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&iboe->lock)->rlock);
<Interrupt>
lock(&(&iboe->lock)->rlock);
*** DEADLOCK ***
The above problem was caused by the spin lock being taken both in the process
context and in a soft-irq context (in a netdev notifier handler).
The required fix is to use spin_lock/unlock_bh() instead of spin_lock/unlock
on the iboe lock.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When a RoCE port becomes active and the netdev of the port has upper
device (e.g bond/team), GIDs derived from the upper dev should appear
in the port's RoCE GID table.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There's no need to reset the gid table twice and we need to do it only
for Ethernet ports. Also, no need to actively scan ndetdevs since it's
being done immediatly after we register netdev notifiers.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When reading the IPv6 addresses from the net-device, make sure to
avoid adding a duplicate entry to the GID table because of equality
between the default GID we generate and the default IPv6 link-local
address of the device.
Fixes: acc4fccf4e ("IB/mlx4: Make sure GID index 0 is always occupied")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When Ethernet netdev is not present for a port (e.g. when the link
layer type of the port is InfiniBand) it's possible to dereference a
null pointer when we do netdevice scanning.
To fix that, we move a section of code that needs to run only when
netdev is present to a proper if () statement.
Fixes: ad4885d279 ("IB/mlx4: Build the port IBoE GID table properly under bonding")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch addresses feedback from Sagi Grimberg on the rereg_mr
implementation of mlx4. The following are fixed:
1. Set the correct pd_flags
2. Make sure we change the iova and size MR fields only after
successful write and allocation of the MTTs.
3. Make the error checking more robust
Fixes: e630664c83 ("mlx4_core: Add helper functions to support MR re-registration")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
According to <http://marc.info/?t=138347640900004&r=1&w=2>, revision
A0 of Connect-X does not correctly assemble TSO packets. Disable that
feature on that hardware revision.
Signed-off-by: Roland Dreier <roland@purestorage.com>
We wrongly tested QP context bits without BE conversion
as was spotted by sparse...
drivers/infiniband/hw/mlx4/qp.c:1685:38: sparse: restricted __be32 degrades to integer
Fix that!
Fixes: d2fce8a ('mlx4: Set user-space raw Ethernet QPs to properly handle VXLAN traffic')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changing the vlan stripping policy of the QP isn't supported by older
firmware versions for the INIT2RTR command. Nevertheless, we've used it.
Fix that by doing this policy change using INIT2RTR only if the firmware
supports it, otherwise, we call UPDATE_QP command to do the task.
Fixes: 7677fc9 ('net/mlx4: Strengthen VLAN tags/priorities enforcement in VST mode')
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Raw Ethernet QPs opened from user-space lack the proper setup to
recieve/handle VXLAN traffic when VXLAN offloads are enabled.
Fix that by adding a tunnel steering rule on top of the normal unicast
steering rule and set the tunnel_type field in the QP context.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Registrations options are specified through flags. Definitions of flags will
be in subsequent patches.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This enables the user to change the protection domain, access flags
and translation (address and length) of the MR.
Use basic mlx4_core helper functions to get, update and set MPT and
MTT objects according to the required modifications.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Add iWARP port mapper to avoid conflicts between RDMA and normal
stack TCP connections.
- Fixes for i386 / x86-64 structure padding differences (ABI
compatibility for 32-on-64) from Yann Droneaud.
- A pile of SRP initiator fixes from Bart Van Assche.
- Fixes for a writeback / memory allocation deadlock with NFS over
IPoIB connected mode from Jiri Kosina.
- The usual fixes and cleanups to mlx4, mlx5, cxgb4 and other
low-level drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJTlzyEAAoJEENa44ZhAt0h9yoP/1UeXlejOpCJyiNdtJZ+ilcU
cb0PEzsjzqACyDqcoQ0EpQM3/3emccVIC3uUXK12mzlTIXOFYTeRLays/TbxZDLt
FK5D/NrMmmJmciPt1ZRgUX82kFFRGScEfpkXYs7jxtRaNT7CW5KwSNQr6aFXskUz
1gpdK1ARCN5rWcGl2HJx5o9C4c/Fa/Vov8lOsAkUZXD1SuPNT/fFN0u1pRzU68g0
k3oj81XnZq5ejOBQKXEHImcmjXwaJ2yjmzxhSsKebqDWDdXuS/F9e4taKneHTZmr
AdwJaLLJPWmAGi/vYYhkuLKpzIDpzMCqwr39lEabmjWvznYOlnjfVUXwUTE2nwNC
DIXuHOLFrSvF2cNxh8ZeEYKS8AV+PjAOahPC5whkWkY256Q67uB7cy9ilWAK+7xS
QcQ5Inr6iXvxIGYA4hNwUo8aK0NuKFwhkVVFEbkPaurbQZPqiKwyVE3w2FOws/Qp
0kLLCVvpRQYjKzkxyof2tb1AcNuVNKXHrYk6RaBDJ9mjxHbhvY4OSt4CBxAAXBu6
zoedUydN1Nz1UgAB1jDsBdyE2QQnXockA1+JJKNq6gM5Dz0DUdAylzQ2NqY9tnYz
RTzihEPYIiQUkV3B8ErbqsuO6z7M830AXO5AR6bLZn1zgJ0cbMLBaKLA8LRufJI/
qxNVwL32Uv1PjKZ+yX1x
=Wcdc
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull main InfiniBand/RDMA updates from Roland Dreier:
- add iWARP port mapper to avoid conflicts between RDMA and normal
stack TCP connections.
- fixes for i386 / x86-64 structure padding differences (ABI
compatibility for 32-on-64) from Yann Droneaud.
- a pile of SRP initiator fixes from Bart Van Assche.
- fixes for a writeback / memory allocation deadlock with NFS over
IPoIB connected mode from Jiri Kosina.
- the usual fixes and cleanups to mlx4, mlx5, cxgb4 and other low-level
drivers.
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (61 commits)
RDMA/cxgb4: Add support for iWARP Port Mapper user space service
RDMA/nes: Add support for iWARP Port Mapper user space service
RDMA/core: Add support for iWARP Port Mapper user space service
IB/mlx4: Fix gfp passing in create_qp_common()
IB/umad: Fix use-after-free on close
IB/core: Fix kobject leak on device register error flow
RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
mlx4_core: Fix GFP flags parameters to be gfp_t
IB/core: Fix port kobject deletion during error flow
IB/core: Remove unneeded kobject_get/put calls
IB/core: Fix sparse warnings about redeclared functions
IB/mad: Fix sparse warning about gfp_t use
IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO
IB: Add a QP creation flag to use GFP_NOIO allocations
IB: Return error for unsupported QP creation flags
IB: Allow build of hw/ and ulp/ subdirectories independently
mlx4_core: Move handling of MLX4_QP_ST_MLX to proper switch statement
RDMA/cxgb4: Add missing padding at end of struct c4iw_create_cq_resp
IB/srp: Avoid problems if a header uses pr_fmt
IB/umad: Fix error handling
...
There are two kzalloc() calls which were not converted to use value of
gfp passed to create_qp_common() instead of using hardcoded GFP_KERNEL
in 40f2287bd5 ("IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO"). Fix
this by passing gfp value down properly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the various routines used to allocate memory resources which
serve QPs in mlx4 to get an input GFP directive. Have the Ethernet
driver to use GFP_KERNEL in it's QP allocations as done prior to this
commit, and the IB driver to use GFP_NOIO when the IB verbs
IB_QP_CREATE_USE_GFP_NOIO QP creation flag is provided.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This commit adds the sysfs interface for enabling QP0 on VFs for
selected VF/port.
By default, no VFs are enabled for QP0 operation.
To enable QP0 operation on a VF/port, under
/sys/class/infiniband/mlx4_x/iov/<b:d:f>/ports/x there are two new entries:
- smi_enabled (read-only). Indicates whether smi is currently
enabled for the indicated VF/port
- enable_smi_admin (rw). Used by the admin to request that smi
capability be enabled or disabled for the indicated VF/port.
0 = disable, 1 = enable.
The requested enablement will occur at the next reset of the
VF (e.g. driver restart on the VM which owns the VF).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This commit adds the infrastructure for enabling selected VFs to
operate SMI (QP0) MADs without restriction.
Additionally, for these enabled VFs, their QP0 proxy and tunnel QPs
are MLX QPs. As such, they operate over VL15. Therefore, they are
not affected by "credit" problems or changes in the VLArb table (which
may shut down VL0).
Non-enabled VFs may only create UD proxy QP0 qps (which are forced by
the hypervisor to send packets using the q-key it assigns and places
in the qp-context). Thus, non-enabled VFs will not pose a security
risk. The hypervisor discards any privileged MADs it receives from
these non-enabled VFs.
By default, all VFs are NOT enabled, and must explicitly be enabled
by the administrator.
The sysfs interface which operates the VF enablement infrastructure
is provided in the next commit.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, VFs in SRIOV VFs are denied QP0 access. The main reason
for this decision is security, since Subnet Management Datagrams
(SMPs) are not restricted by network partitioning and may affect the
physical network topology. Moreover, even the SM may be denied access
from portions of the network by setting management keys unknown to the
SM.
However, it is desirable to grant SMI access to certain privileged
VFs, so that certain network management activities may be conducted
within virtual machines instead of the hypervisor.
This commit does the following:
1. Create QP0 tunnel QPs for all VFs.
2. Discard SMI mads sent-from/received-for non-privileged VFs in the
hypervisor MAD multiplex/demultiplex logic. SMI mads from/for
privileged VFs are allowed to pass.
3. MAD_IFC wrapper changes/fixes. For non-privileged VFs, only
host-view MAD_IFC commands are allowed, and only for SMI LID-Routed
GET mads. For privileged VFs, there are no restrictions.
This commit does not allow privileged VFs as yet. To determine if a VF
is privileged, it calls function mlx4_vf_smi_enabled(). This function
returns 0 unconditionally for now.
The next two commits allow defining and activating privileged VFs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
mlx4_ib_modify_port is invoked in IB for resetting the Q_Key violations
counters and for modifying the IB port capability flags.
For example, when opensm is started up on the hypervisor,
mlx4_ib_modify_port is called to set the port's IsSM flag.
In multifunction mode, the SET_PORT command used in this flow should
be wrapped (so that the PF port capability flags are also tracked,
thus enabling the aggregate of all the VF/PF capability flags to be
tracked properly).
The procedure mlx4_SET_PORT() in main.c is also renamed to mlx4_ib_SET_PORT()
to differentiate it from procedure mlx4_SET_PORT() in port.c.
mlx4_ib_SET_PORT() is used exclusively by mlx4_ib_modify_port().
Finally, the CM invokes ib_modify_port() to set the IsCMSupported flag
even when running over RoCE. Therefore, when RoCE is active,
mlx4_ib_modify_port should return OK unconditionally (since the
capability flags and qkey violations counter are not relevant).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 297e0dad72 ("IB/mlx4: Handle Ethernet L2 parameters for IP
based GID addressing") introduced a bug where is_mcast is now no
longer initialized on the non-multicast condition and so it can be
any random value from the stack. This issue was detected by cppcheck:
[drivers/infiniband/hw/mlx4/ah.c:103]: (error) Uninitialized
variable: is_mcast
Simple fix is to initialise is_mcast to zero.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When we receive a netdev event indicating a netdev change and/or
a netdev address change, we must change the MAC index used by the
proxy QP1 (in the QP context), otherwise RoCE CM packets sent by the
VF will not carry the same source MAC address as the non-CM packets.
We use the UPDATE_QP command to perform this change.
In order to avoid modifying a QP context based on netdev event,
while the driver attempts to destroy this QP (e.g either the mlx4_ib
or ib_mad modules are unloaded), we use mutex locking in both flows.
Since the relevant mlx4 proxy GSI QP is created indirectly by the
mad module when they create their GSI QP, the mlx4 didn't need to
keep track on that QP prior to this change.
Now, when QP modifications are needed to this QP from within the
driver, we added refernece to it.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- The biggest change is core API extensions and mlx5 low-level driver
support for handling DIF/DIX-style protection information, and the
addition of PI support to the iSER initiator. Target support will be
arriving shortly through the SCSI target tree.
- A nice simplification to the "umem" memory pinning library now that
we have chained sg lists. Kudos to Yishai Hadas for realizing our
code didn't have to be so crazy.
- Another nice simplification to the sg wrappers used by qib, ipath and
ehca to handle their mapping of memory to adapter.
- The usual batch of fixes to bugs found by static checkers etc. from
intrepid people like Dan Carpenter and Yann Droneaud.
- A large batch of cxgb4, ocrdma, qib driver updates.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJTPYBnAAoJEENa44ZhAt0hGI4P/29eotGwpkANUQE6FQvxCUL2
CXJtSg52lmYvGJrPK4IhihpbtQmHJz3iXEzlOOWidTw1dJgObR6vFaRymh7+vDLs
CdzybMcXdasarqTuYeJbFzhkimpwtWWrMy/8Ik/Jj/5glGQ6cUSpdYZzVtFhYNqf
hCGE8iLi+tuekJJj1htut5D6apXM7udcdc2yLJNOdsSj/VUXt1oqG1x9xAi9R8Tq
7o8eFSStdlja0EBQ6Hli2zauCSnQkaUtr8h6EAFbcCtvBK8HqsHSc2gfq2ViFUiN
ztt167oWoQnVkR0qCPL5nVt+CRQHHROprVXvbpcTI3aW61gNIl6OrUUOXefzHXac
TNi+fdMpiEB/JQ4Z04Jzd1dGCSjYeTqPj4rO4meFjBmxRDdTgZHu7FWwejT1nYJ5
d2abVdCOT+QWlIlM7m/pjdWJII5OYM+4/jtTayGepEaR4fTUzKtPZPBLNUBDBKE+
4f92PC8LiuPkwJgb6XT96onPz1bDCOnPSEdwoKUFKPeGUcwgVOM/Wx5NU4Yf7rfg
RxQwZ7mJXbjCYFlmGGo/0QDy6UEGkIFYlJSzooP+wlK1JvZ5h2M+9QKX2FtwzR+R
I2kBxcTXWsM/h88R7MkNqbNIllmhssrJwmAE46OneZbfoBOB+JZjb4nLRTu0jEcS
zn6f16GmJ37BKn2/qYY/
=Ww6H
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband updates from Roland Dreier:
"Main batch of InfiniBand/RDMA changes for 3.15:
- The biggest change is core API extensions and mlx5 low-level driver
support for handling DIF/DIX-style protection information, and the
addition of PI support to the iSER initiator. Target support will
be arriving shortly through the SCSI target tree.
- A nice simplification to the "umem" memory pinning library now that
we have chained sg lists. Kudos to Yishai Hadas for realizing our
code didn't have to be so crazy.
- Another nice simplification to the sg wrappers used by qib, ipath
and ehca to handle their mapping of memory to adapter.
- The usual batch of fixes to bugs found by static checkers etc.
from intrepid people like Dan Carpenter and Yann Droneaud.
- A large batch of cxgb4, ocrdma, qib driver updates"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (102 commits)
RDMA/ocrdma: Unregister inet notifier when unloading ocrdma
RDMA/ocrdma: Fix warnings about pointer <-> integer casts
RDMA/ocrdma: Code clean-up
RDMA/ocrdma: Display FW version
RDMA/ocrdma: Query controller information
RDMA/ocrdma: Support non-embedded mailbox commands
RDMA/ocrdma: Handle CQ overrun error
RDMA/ocrdma: Display proper value for max_mw
RDMA/ocrdma: Use non-zero tag in SRQ posting
RDMA/ocrdma: Memory leak fix in ocrdma_dereg_mr()
RDMA/ocrdma: Increment abi version count
RDMA/ocrdma: Update version string
be2net: Add abi version between be2net and ocrdma
RDMA/ocrdma: ABI versioning between ocrdma and be2net
RDMA/ocrdma: Allow DPP QP creation
RDMA/ocrdma: Read ASIC_ID register to select asic_gen
RDMA/ocrdma: SQ and RQ doorbell offset clean up
RDMA/ocrdma: EQ full catastrophe avoidance
RDMA/cxgb4: Disable DSGL use by default
RDMA/cxgb4: rx_data() needs to hold the ep mutex
...
My static checker complains that the sprintf() here can overflow.
drivers/infiniband/hw/mlx4/main.c:1836 mlx4_ib_alloc_eqs()
error: format string overflow. buf_size: 32 length: 69
This seems like a valid complaint. The "dev->pdev->bus->name" string
can be 48 characters long. I just made the buffer 80 characters instead
of 69 and I changed the sprintf() to snprintf().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code was indented too far and also kernel style says we should have
curly braces.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The code which is dealing with SRIOV alias GUIDs in the mlx4 IB driver has some
logic which operated according to the maximal possible active functions (PF + VFs).
After the single port VFs code integration this resulted in a flow of false-positive
warnings going to the kernel log after the PF driver started the alias GUID work.
Fix it by referring to the actual number of functions.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for N-Port VFs, this includes:
1. Adding support in the wrapped FW command
In wrapped commands, we need to verify and convert
the slave's port into the real physical port.
Furthermore, when sending the response back to the slave,
a reverse conversion should be made.
2. Adjusting sqpn for QP1 para-virtualization
The slave assumes that sqpn is used for QP1 communication.
If the slave is assigned to a port != (first port), we need
to adjust the sqpn that will direct its QP1 packets into the
correct endpoint.
3. Adjusting gid[5] to modify the port for raw ethernet
In B0 steering, gid[5] contains the port. It needs
to be adjusted into the physical port.
4. Adjusting number of ports in the query / ports caps in the FW commands
When a slave queries the hardware, it needs to view only
the physical ports it's assigned to.
5. Adjusting the sched_qp according to the port number
The QP port is encoded in the sched_qp, thus in modify_qp we need
to encode the correct port in sched_qp.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some code in the mlx4 IB driver stack assumed MLX4_MAX_PORTS ports.
Instead, we should only loop until the number of actual ports in i
the device, which is stored in dev->caps.num_ports.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the following warning for the mlx4 driver:
$ make M=drivers/infiniband C=2 CF=-D__CHECK_ENDIAN__
drivers/infiniband/hw/mlx4/qp.c:1885:31: warning: restricted __be16 degrades to integer
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
To activate RoCE/SRIOV, need to remove the following:
1. In mlx4_ib_add, need to remove the error return preventing
initialization of a RoCE port under SRIOV.
2. In update_vport_qp_params (in resource_tracker.c) need to remove
the error return when a RoCE RC or UD qp is detected.
This error return causes the INIT-to-RTR qp transition to fail
in the wrapper function under RoCE/SRIOV.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Handle CM_SIDR_REQ_ATTR_ID and CM_SIDR_REP_ATTR_ID
in multiplex_cm_handler and demux_cm_handler.
* Handle Service ID Resolution messages and REQ messages
separately, for their formats are different.
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since there is no connection between the MAC/VLAN and the GID
when using IP-based addressing, the proxy QP1 (running on the
slave) must pass the source-mac, destination-mac, and vlan_id
information separately from the GID. Additionally, the Host
must pass the remote source-mac and vlan_id back to the slave,
This is achieved as follows:
Outgoing MADs:
1. Source MAC: obtained from the CQ completion structure
(struct ib_wc, smac field).
2. Destination MAC: obtained from the tunnel header
3. vlan_id: obtained from the tunnel header.
Incoming MADs
1. The source (i.e., remote) MAC and vlan_id are passed in
the tunnel header to the proxy QP1.
VST mode support:
For outgoing MADs, the vlan_id obtained from the header is
discarded, and the vlan_id specified by the Hypervisor is used
instead.
For incoming MADs, the incoming vlan_id (in the wc) is discarded, and the
"invalid" vlan (0xffff) is substituted when forwarding to the slave.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IB side of RoCE requires the MAC table index of the
MAC address used by its QPs.
To obtain the real MAC index, the IB side registers the
MAC (increasing its ref count, and also returning the
real MAC index) during the modify-qp sequence.
This protects against the ETH side deleting or modifying
that MAC table entry while the QP is active.
Note that until the modify-qp command returns success,
the MAC and VLAN information only has "candidate" status.
If the modify-qp succeeds, the "candidate" info is promoted
to the operational MAC/VLAN info for the qp. If the modify fails,
the candidate MAC/VLAN is unregistered, and the old qp info
is preserved.
The patch is a bit complex, because there are multiple qp
transitions where the primary-path information may be
modified: INIT-to-RTR, and SQD-to-SQD.
Similarly for the alternate path information.
Therefore the code must handle cases where path information
has already been entered into the QP context by previous
qp transitions.
For the MAC address, the success logic is as follows:
1. If there was no previous MAC, simply move the candidate
MAC information to the operational information, and reset
the candidate MAC info.
2. If there was a previous MAC, unregister it. Then move
the MAC information from candidate to operational, and
reset the candidate info (as in 1. above).
The MAC address failure logic is the same for all cases:
- Unregister the candidate MAC, and reset the candidate MAC info.
For Vlan registration, the logic is similar.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The GIDs are statically distributed, as follows:
PF: gets 16 GIDs
VFs: Remaining GIDS are divided evenly between VFs activated by the driver.
If the division is not even, lower-numbered VFs get an extra GID.
For an IB interface, the number of gids per guest remains as before: one gid per guest.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This requires the following modifications:
1. Fix build_mlx4_header to properly fill in the ETH fields
2. Adjust mux and demux QP1 flow to support RoCE.
This commit still assumes only one GID per slave for RoCE.
The commit enabling multiple GIDs is a subsequent commit, and
is done separately because of its complexity.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch refactors the IB core umem code and vendor drivers to use a
linear (chained) SG table instead of chunk list. With this change the
relevant code becomes clearer—no need for nested loops to build and
use umem.
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For userspace RoCE UD QPs we need to know the GID format that the
kernel uses, e.g when working over older kernels. For that end, add a
new port capability IB_PORT_IP_BASED_GIDS and report it when query
port is issued.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When scanning netdevices we need to check a few more conditions and
cases to build the IBoE GID table properly. For example, under
bonding we must make sure that when a port is down, the bond IP
address isn't programmed as a GID, since doing so will cause failure
with IB core flows that selects ports by GID.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The IBoE code used to reset the GID table did it for all Ethernet
ports of the device. Since the whole architecture of generating GIDs
and responding to events is port-based, this is inefficient and can
lead to wrong content in the GID table. Change the reset flow to be
per-port.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Updating the GID table under IBoE requires read/write from/to shared
data structures. These data structures are protected with the device
iboe lock. The flows that modify the GID table start from
1. Initializing the GID table
2. NETDEV events
3. INET or INET6 events
This patch makes sure that the flow of initializing the GID table is
consistent with the other two flows w.r.t on what step the lock is taken.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
On the one hand, the invocation of netdev_master_upper_dev_get()
within mlx4_ib_scan_netdevs() must be done with rtnl lock held. On
the other hand, it's wrong to call rtnl_lock() from within this
function since it's also called by our netdev notifier callback.
Therefore move the locking to mlx4_ib_add() so that both cases are
covered.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Make sure that for Ethernet ports, the port GID table index 0 is always
occupied with a default GID of the relevant IPv6 link-local adderss.
This provides better user experience for legacy applications that don't use
the RDMA CM and were working on index 0 prior to the IP addressing change.
Also, as GIDs are generated from IP addresses of the network devices that
are associated with the port, it's basically possible that the GID table
will be empty if no IP address was assigned. This doesn't comply with the
IB spec section 4.1.1 "GID usage and properties".
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the device has only Ethernet ports, don't try to allocate range
of steerable UD QPs since they aren't needed. This fixes an issue
where mlx4 VFs tried to allocate a range of UD steerable QPs, but
failed to do so.
Fixes: c1c9850112 ("IB/mlx4: Add support for steerable IB UD QPs")
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since mlx4_ib supports IP based addressing, a dependency on INET needs
to be added, since mlx4_ib registers itself for net device events.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IP based RoCE gids don't store Ethernet L2 parameters, MAC and VLAN.
Therefore, we need to extract them from the CQE and place them in
struct ib_wc (to be used for cases were they were taken from the gid).
Also, when modifying a QP or building address handle, instead of
parsing the dgid to get the MAC and VLAN, take them from the address
handle attributes.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, the mlx4 driver set IBoE (RoCE) gids to encode related
Ethernet netdevice interface MAC address and possibly VLAN id.
Change this scheme such that gids encode interface IP addresses (both
IP4 and IPv6).
This requires learning the IP addresses which are of use by a
netdevice associated with the HCA port, formatting them to gids and
adding them to the port gid table. Furthermore, events of add and
delete address are caught to maintain the gid table accordingly.
Associated IP addresses may belong to a master of an Ethernet
netdevice on top of that port so this should be considered when
building and maintaining the gid table.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set the return variable to an error code as done elsewhere in the function.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
(
if@p1 (\(ret < 0\|ret != 0\))
{ ... return ret; }
|
ret@p1 = 0
)
... when != ret = e1
when != &ret
*if(...)
{
... when != ret = e2
when forall
return ret;
}
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.
When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.
Thus, those attributes were added to the following structures:
* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id
For the path record structure, extra care was taken to avoid the new
fields when packing it into wire format, so we don't break the IB CM
and SA wire protocol.
On the active side, the CM fills. its internal structures from the
path provided by the ULP. We add there taking the ETH L2 attributes
and placing them into the CM Address Handle (struct cm_av).
On the passive side, the CM fills its internal structures from the WC
associated with the REQ message. We add there taking the ETH L2
attributes from the WC.
When the HW driver provides the required ETH L2 attributes in the WC,
they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
code checks for the presence of these flags, and in their absence does
address resolution from the ib_init_ah_from_wc() helper function.
ib_modify_qp_is_ok is also updated to consider the link layer. Some
parameters are mandatory for Ethernet link layer, while they are
irrelevant for IB. Vendor drivers are modified to support the new
function signature.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds support for steerable (NETIF) QP creation. When we
create the device, we allocate a range of steerable QPs.
Afterward when a QP is created with the NETIF flag, it's allocated
from this range. Allocation is managed by bitmap allocator.
Internal steering rules for those QPs is automatically generated on
their creation.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The mlx4 device requires adding IB flow spec to rules that apply over
infiniband link layer. This patch adds a mechanism to add such a rule.
If higher levels e.g. IP/UDP/TCP flow specs are provided, the device
requires us to add an empty wild-carded IB rule. Furthermore, the device
requires the QPN to be put in the rule.
Add here specific parsing support for IB empty rules and the ability
to self-generate missing specs based on existing ones.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Up until now, flow steering wasn't supported when using IB ports.
This patch enables support for flow steering if all hardware ports
support that, for example the new MLX4_DEV_CAP_FLAG2_DMFS_IPOIB mlx4
device capability.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABCAAGBQJSil7BAAoJEENa44ZhAt0hbtgP/A+AmUalbOX6ZKzuOFxsrtY2
r55CX9b1JBeFM/Zhn2o6y+81lpCjkckJSggESMe4izNgocGw0nW4vYGN4SBynatj
y8sR9OSn+G3ihuENrzG41MJUGEa5WbcNMy4boN+Oa+qyTlV/WjLR7Fv4WbikK7Wm
o8FNlXiiDhMoGfHHG5J0MD0EQsnxuLDk2XP+ciu4tLtTs+wBka+gFK8WnMvztle3
gTeMNna5ilvCS2fdBxteuPA3KeDnJE9AgJSMJ2a4Rh+DR8uTgWYQ6n7amjmOc546
yhAKkoBkxPE10+Yj82WOPhCFxSeWcuSwJvpgv5dTVZ1XqUUcC1V3TEcZDHmyyHQ7
uPXgS1A+erBW3OYPBjZqtKvnHObscV12fL+rId3vIhcAQIbFroci08ZwPidEYRkn
fvwlEKcrIsBIpRXEyjlFCxsiiDnfq1wC1VayMR3jrIK0P6idf1SXf/geiRp9+RGT
wKUc0j51jvEx29qc65xuhEP9FQV9pCMxyd+FEE0d0KkjMz5hsIkjmcUcBbgF0CGg
GEyDPlgRLv+vmWDGpT8XraaV/0CJOEQDIgB4WSN87/AZ4UoNt7spW2xqsLsp1toy
5e0100tpWUleTPLe/Wig5GtBdagQ2jAUK1+186CP93pFPtkwc4/7X3hyp7qPIPTz
VDvT9DEy6zjSMCLpMcdo
=xxC+
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma updates from Roland Dreier:
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
IB/core: Re-enable create_flow/destroy_flow uverbs
IB/core: extended command: an improved infrastructure for uverbs commands
IB/core: Remove ib_uverbs_flow_spec structure from userspace
IB/core: Use a common header for uverbs flow_specs
IB/core: Make uverbs flow structure use names like verbs ones
IB/core: Rename 'flow' structs to match other uverbs structs
IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
IB/ucma: Convert use of typedef ctl_table to struct ctl_table
IB/cm: Convert to using idr_alloc_cyclic()
IB/mlx5: Fix page shift in create CQ for userspace
IB/mlx4: Fix device max capabilities check
IB/mlx5: Fix list_del of empty list
IB/mlx5: Remove dead code
IB/core: Encorce MR access rights rules on kernel consumers
IB/mlx4: Fix endless loop in resize CQ
RDMA/cma: Remove unused argument and minor dead code
RDMA/ucma: Discard events for IDs not yet claimed by user space
IB/core: Add Cisco usNIC rdma node and transport types
RDMA/nes: Remove self-assignment from nes_query_qp()
IB/srp: Report receive errors correctly
...
This commit reverts commit 7afbddfae9 ("IB/core: Temporarily disable
create_flow/destroy_flow uverbs"). Since the uverbs extensions
functionality was experimental for v3.12, this patch re-enables the
support for them and flow-steering for v3.13.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 400dbc9658 ("IB/core: Infrastructure for extensible uverbs
commands") added an infrastructure for extensible uverbs commands
while later commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow
through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
using this new infrastructure.
According to the commit 400dbc9658, the purpose of this
infrastructure is to support passing around provider (eg. hardware)
specific buffers when userspace issue commands to the kernel, so that
it would be possible to extend uverbs (eg. core) buffers independently
from the provider buffers.
But the new kernel command function prototypes were not modified to
take advantage of this extension. This issue was exposed by Roland
Dreier in a previous review[1].
So the following patch is an attempt to a revised extensible command
infrastructure.
This improved extensible command infrastructure distinguish between
core (eg. legacy)'s command/response buffers from provider
(eg. hardware)'s command/response buffers: each extended command
implementing function is given a struct ib_udata to hold core
(eg. uverbs) input and output buffers, and another struct ib_udata to
hold the hw (eg. provider) input and output buffers.
Having those buffers identified separately make it easier to increase
one buffer to support extension without having to add some code to
guess the exact size of each command/response parts: This should make
the extended functions more reliable.
Additionally, instead of relying on command identifier being greater
than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
unused bits in command field: on the 32 bits provided by command
field, only 6 bits are really needed to encode the identifier of
commands currently supported by the kernel. (Even using only 6 bits
leaves room for about 23 new commands).
So this patch makes use of some high order bits in command field to
store flags, leaving enough room for more command identifiers than one
will ever need (eg. 256).
The new flags are used to specify if the command should be processed
as an extended one or a legacy one. While designing the new command
format, care was taken to make usage of flags itself extensible.
Using high order bits of the commands field ensure that newer
libibverbs on older kernel will properly fail when trying to call
extended commands. On the other hand, older libibverbs on newer kernel
will never be able to issue calls to extended commands.
The extended command header includes the optional response pointer so
that output buffer length and output buffer pointer are located
together in the command, allowing proper parameters checking. This
should make implementing functions easier and safer.
Additionally the extended header ensure 64bits alignment, while making
all sizes multiple of 8 bytes, extending the maximum buffer size:
legacy extended
Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
For the purpose of doing proper buffer size accounting, the headers
size are no more taken in account in "in_words".
One of the odds of the current extensible infrastructure, reading
twice the "legacy" command header, is fixed by removing the "legacy"
command header from the extended command header: they are processed as
two different parts of the command: memory is read once and
information are not duplicated: it's making clear that's an extended
command scheme and not a different command scheme.
The proposed scheme will format input (command) and output (response)
buffers this way:
- command:
legacy header +
extended header +
command data (core + hw):
+----------------------------------------+
| flags | 00 00 | command |
| in_words | out_words |
+----------------------------------------+
| response |
| response |
| provider_in_words | provider_out_words |
| padding |
+----------------------------------------+
| |
. <uverbs input> .
. (in_words * 8) .
| |
+----------------------------------------+
| |
. <provider input> .
. (provider_in_words * 8) .
| |
+----------------------------------------+
- response, if present:
+----------------------------------------+
| |
. <uverbs output space> .
. (out_words * 8) .
| |
+----------------------------------------+
| |
. <provider output space> .
. (provider_out_words * 8) .
| |
+----------------------------------------+
The overall design is to ensure that the extensible infrastructure is
itself extensible while begin more reliable with more input and bound
checking.
Note:
The unused field in the extended header would be perfect candidate to
hold the command "comp_mask" (eg. bit field used to handle
compatibility). This was suggested by Roland Dreier in a previous
review[2]. But "comp_mask" field is likely to be present in the uverb
input and/or provider input, likewise for the response, as noted by
Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
header.
[1]:
http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
[2]:
http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
[3]:
http://marc.info/?i=525C1149.6000701@mellanox.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
[ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Move the check on max supported CQEs after the final number of entries is
evaluated.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When calling get_sw_cqe() we need pass the consumer_index and not the
masked value. Failure to do so will cause incorrect result of
get_sw_cqe() possibly leading to endless loop.
This problem was reported and analyzed by Michael Rice from HP.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
To guarantee that all unused fields in all FW commands for both inboxes
and outboxes are zeroed out, initialize the mailbox buffer to all zeroes.
This is especially important for SRIOV comm-channel virtual commands
(such as QUERY_FUNC_CAP), where if new fields are added to support new
features, the driver can depend on older kernels passing zeroes in these
fields.
In addition to zeroing out the mailbox buffer at allocation time, all
(now unnecessary) calls to memset by the callers of
mlx4_alloc_cmd_mailbox() are removed.
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is step #1 for implementing SRIOV resource quotas for VFs.
Quotas are implemented per resource type for VFs and the PF, to prevent
any entity from simply grabbing all the resources for itself and leaving
the other entities unable to obtain such resources.
Resources which are allocated using quotas: QPs, CQs, SRQs, MPTs, MTTs, MAC,
VLAN, and Counters.
The quota system works as follows:
Each entity (VF or PF) is given a max number of a given resource (its quota),
and a guaranteed minimum number for each resource (starvation prevention).
For QPs, CQs, SRQs, MPTs and MTTs:
50% of the available quantity for the resource is divided equally among
the PF and all the active VFs (i.e., the number of VFs in the mlx4_core module
parameter "num_vfs"). This 50% represents the "guaranteed minimum" pool.
The other 50% is the "free pool", allocated on a first-come-first-serve basis.
For each VF/PF, resources are first allocated from its "guaranteed-minimum"
pool. When that pool is exhausted, the driver attempts to allocate from
the resource "free-pool".
The quota (i.e., max) for the VFs and the PF is:
The free-pool amount (50% of the real max) + the guaranteed minimum
For MACs:
Guarantee 2 MACs per VF/PF per port. As a result, since we have only
128 MACs per port, reduce the allowable number of VFs from 64 to 63.
Any remaining MACs are put into a free pool.
For VLANs:
For the PF, the per-port quota is 128 and guarantee is 64
(to allow the PF to register at least a VLAN per VF in VST mode).
For the VFs, the per-port quota is 64 and the guarantee is 0.
We assume that VGT VFs are trusted not to abuse the VLAN resource.
For Counters:
For all functions (PF and VFs), the quota is 128 and the guarantee is 0.
In this patch, we define the needed structures, which are added to the
resource-tracker struct. In addition, we do initialization
for the resource quota, and adjust the query_device response to use quotas
rather than resource maxima.
As part of the implementation, we introduce a new field in
mlx4_dev: quotas. This field holds the resource quotas used
to report maxima to the upper layers (ib_core, via query_device).
The HCA maxima of these values are passed to the VFs (via
QUERY_HCA) so that they may continue to use these in handling
QPs, CQs, SRQs and MPTs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The create_flow/destroy_flow uverbs and the associated extensions to
the user-kernel verbs ABI are under review and are too experimental to
freeze at this point.
So userspace is not exposed to experimental features and an uinstable
ABI, temporarily disable this for v3.12 (with a Kconfig option behind
staging to reenable it if desired).
The feature will be enabled after proper cleanup for v3.13.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1381351016.git.ydroneaud@opteya.com
Link: http://marc.info/?i=cover.1381177342.git.ydroneaud@opteya.com
[ Add a Kconfig option to reenable these verbs. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement ib_create_flow() and ib_destroy_flow().
Translate the verbs structures provided by the user to HW structures
and call the MLX4_QP_FLOW_STEERING_ATTACH/DETACH firmware commands.
On the ATTACH command completion, the firmware provides a 64-bit
registration ID, which is placed into struct mlx4_ib_flow that wraps
the instance of struct ib_flow which is retuned to caller. Later,
this reg ID is used for detaching that flow from the firmware.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When creating tunnel QPs for special QP tunneling, look for the
default pkey in the slave's virtual pkey table. If it is present, use
the real pkey index where the default pkey is located.
If the default pkey is not found in the pkey table, use the real pkey
index which is stored at index 0 in the slave's virtual pkey table
(this is the current behavior).
This change is required to support cloud computing, where the
paravirtualized index of the default pkey is moved to index 1 or
higher. The pkey at paravirtualized index 0 is used for the default
IPoIB interface created by the VF.
Its possible for the pkey value at paravirtualized index 0 to be
invalid (zero) at VF probe time (pkey index 0 is mapped to real pkey
index 127, which contains pkey = 0).
At some point after the VF probe, the cloud computing interface at the
hypervisor maps virtual index 0 for the VF to the pkey index
containing the pkey that IPoIB will use in its operation. However,
when the tunnel QP is created, the pkey at the slave's virtual index 0
is still mapped to the invalid pkey index, so tunnel QP creation
fails.
This commit causes the hypervisor to search for the default pkey in
the slave's pkey table -- and this pkey is present in the table (at
index > 0) at tunnel QP creation time, so that the tunnel QP creation
will succeed.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
- XRC transport fixes
- Fix DHCP on IPoIB
- mlx4 preparations for flow steering
- iSER fixes
- miscellaneous other fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJRisChAAoJEENa44ZhAt0hHLoP/iX5MxtHJ3X1u5KcARX/7nci
CCH/VnD172re/KavCCg7zkbZQpS4jHCtW/CzLUCSPqBGOaj78HFTUkB3ragvUW7m
ndERwplF8DP/i0x7Kk7Wau2a4RdlH0lqwucOjqTyQnbIdxknkz6w3Jcsb9Ic2lzx
up0T0HaHtTxdVF6lXOB5QOIpUGg3l0Yu4euX2BA61WDoZj+VIYhgeWZewq0iV0D1
rLtarJ+Or7mdwu2rNcDHgrD0lhF4SCBd3rx4lbc4F68Cr8JUz0Xe7liPLNskeLhW
f3NEm3gmkYp9YI1otGsA0X/CyV6wnRk4mT8JMlOb2WNzeq2V13Z54/9ZyF5/gFD2
JgzkQB9Ibf7EmTwXWd6+0+FA40Q6dNvnRnhddRM255dvDVw7nxUr2UzYH4Re/Z9K
rNFjkvix2YUwEmoPjitWocz2kj2reDMqjtiVDmdGy1YbtnicH5GtkQsWkoPg8ON9
m5jORUdzydTD+yBJwTiFP1EuFoG3TdfoZ7zHMJwWy/u8i308xD6WPGms9MTdjh8j
7gjz2TCKr+vpuVRh/p6esCPPOTSsSeWDeowy7Sgpdf3qoqAImXsWXVrl2kXLhtyl
1VIgHU3ztm7oqwmy0gQ/zVCo4CLLdsif2zmEIDpxJPnWaSq+D9LJdyLTcfdBjhQ8
9SjUafe4msT1pIjNb7ND
=1ojD
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA changes from Roland Dreier:
- XRC transport fixes
- Fix DHCP on IPoIB
- mlx4 preparations for flow steering
- iSER fixes
- miscellaneous other fixes
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (23 commits)
IB/iser: Add support for iser CM REQ additional info
IB/iser: Return error to upper layers on EAGAIN registration failures
IB/iser: Move informational messages from error to info level
IB/iser: Add module version
mlx4_core: Expose a few helpers to fill DMFS HW strucutures
mlx4_core: Directly expose fields of DMFS HW rule control segment
mlx4_core: Change a few DMFS fields names to match firmare spec
mlx4: Match DMFS promiscuous field names to firmware spec
mlx4_core: Move DMFS HW structs to common header file
IB/mlx4: Set link type for RAW PACKET QPs in the QP context
IB/mlx4: Disable VLAN stripping for RAW PACKET QPs
mlx4_core: Reduce warning message for SRQ_LIMIT event to debug level
RDMA/iwcm: Don't touch cmid after dropping reference
IB/qib: Correct qib_verbs_register_sysfs() error handling
IB/ipath: Correct ipath_verbs_register_sysfs() error handling
RDMA/cxgb4: Fix SQ allocation when on-chip SQ is disabled
SRPT: Fix odd use of WARN_ON()
IPoIB: Fix ipoib_hard_header() return value
RDMA: Rename random32() to prandom_u32()
RDMA/cxgb3: Fix uninitialized variable
...
Use preferable function name which implies using a pseudo-random number
generator.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
"Highlights (1721 non-merge commits, this has to be a record of some
sort):
1) Add 'random' mode to team driver, from Jiri Pirko and Eric
Dumazet.
2) Make it so that any driver that supports configuration of multiple
MAC addresses can provide the forwarding database add and del
calls by providing a default implementation and hooking that up if
the driver doesn't have an explicit set of handlers. From Vlad
Yasevich.
3) Support GSO segmentation over tunnels and other encapsulating
devices such as VXLAN, from Pravin B Shelar.
4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton.
5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita
Dukkipati.
6) In the PHY layer, allow supporting wake-on-lan in situations where
the PHY registers have to be written for it to be configured.
Use it to support wake-on-lan in mv643xx_eth.
From Michael Stapelberg.
7) Significantly improve firewire IPV6 support, from YOSHIFUJI
Hideaki.
8) Allow multiple packets to be sent in a single transmission using
network coding in batman-adv, from Martin Hundebøll.
9) Add support for T5 cxgb4 chips, from Santosh Rastapur.
10) Generalize the VXLAN forwarding tables so that there is more
flexibility in configurating various aspects of the endpoints.
From David Stevens.
11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver,
from Dmitry Kravkov.
12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo
Neira Ayuso.
13) Start adding networking selftests.
14) In situations of overload on the same AF_PACKET fanout socket, or
per-cpu packet receive queue, minimize drop by distributing the
load to other cpus/fanouts. From Willem de Bruijn and Eric
Dumazet.
15) Add support for new payload offset BPF instruction, from Daniel
Borkmann.
16) Convert several drivers over to mdoule_platform_driver(), from
Sachin Kamat.
17) Provide a minimal BPF JIT image disassembler userspace tool, from
Daniel Borkmann.
18) Rewrite F-RTO implementation in TCP to match the final
specification of it in RFC4138 and RFC5682. From Yuchung Cheng.
19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear
you like netlink, so I implemented netlink dumping of netlink
sockets.") From Andrey Vagin.
20) Remove ugly passing of rtnetlink attributes into rtnl_doit
functions, from Thomas Graf.
21) Allow userspace to be able to see if a configuration change occurs
in the middle of an address or device list dump, from Nicolas
Dichtel.
22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes
Frederic Sowa.
23) Increase accuracy of packet length used by packet scheduler, from
Jason Wang.
24) Beginning set of changes to make ipv4/ipv6 fragment handling more
scalable and less susceptible to overload and locking contention,
from Jesper Dangaard Brouer.
25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*()
instead. From Hong Zhiguo.
26) Optimize route usage in IPVS by avoiding reference counting where
possible, from Julian Anastasov.
27) Convert IPVS schedulers to RCU, also from Julian Anastasov.
28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger
Eitzenberger.
29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG,
nfnetlink_log, and nfnetlink_queue. From Gao feng.
30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa.
31) Support several new r8169 chips, from Hayes Wang.
32) Support tokenized interface identifiers in ipv6, from Daniel
Borkmann.
33) Use usbnet_link_change() helper in USB net driver, from Ming Lei.
34) Add 802.1ad vlan offload support, from Patrick McHardy.
35) Support mmap() based netlink communication, also from Patrick
McHardy.
36) Support HW timestamping in mlx4 driver, from Amir Vadai.
37) Rationalize AF_PACKET packet timestamping when transmitting, from
Willem de Bruijn and Daniel Borkmann.
38) Bring parity to what's provided by /proc/net/packet socket dumping
and the info provided by netlink socket dumping of AF_PACKET
sockets. From Nicolas Dichtel.
39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin
Poirier"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
filter: fix va_list build error
af_unix: fix a fatal race with bit fields
bnx2x: Prevent memory leak when cnic is absent
bnx2x: correct reading of speed capabilities
net: sctp: attribute printl with __printf for gcc fmt checks
netlink: kconfig: move mmap i/o into netlink kconfig
netpoll: convert mutex into a semaphore
netlink: Fix skb ref counting.
net_sched: act_ipt forward compat with xtables
mlx4_en: fix a build error on 32bit arches
Revert "bnx2x: allow nvram test to run when device is down"
bridge: avoid OOPS if root port not found
drivers: net: cpsw: fix kernel warn on cpsw irq enable
sh_eth: use random MAC address if no valid one supplied
3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA)
tg3: fix to append hardware time stamping flags
unix/stream: fix peeking with an offset larger than data in queue
unix/dgram: fix peeking with an offset larger than data in queue
unix/dgram: peek beyond 0-sized skbs
openvswitch: Remove unneeded ovs_netdev_get_ifindex()
...
When the link type is Ethernet, setting the link type in the QP
context will enable TCP/IP stateless offloads (checksum, LSO, RSS) for
RAW PACKET Ethernet QPs. For IB UD QPs this worked OK since the value
assumed by the firmware for IB link layer is zero.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix the asymmetric behavior w.r.t VLAN insertion/stripping for RAW
PACKET QPs -- we don't insert on send and need not strip on receive.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The patch allows to enable/disable HW timestamping for incoming and/or
outgoing packets. It adds and initializes all structs and callbacks
needed by kernel TS API.
To enable/disable HW timestamping appropriate ioctl should be used.
Currently HWTSTAMP_FILTER_ALL/NONE and HWTSAMP_TX_ON/OFF only are
supported.
When enabling TS on receive flow - VLAN stripping will be disabled.
Also were made all relevant changes in RX/TX flows to consider TS request
and plant HW timestamps into relevant structures.
mlx4_ib was fixed to compile with new mlx4_cq_alloc() signature.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use more preferable function name which implies using a pseudo-random
number generator.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: Steve Wise <swise@chelsio.com>
Cc: linux-rdma@vger.kernel.org
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
An XRC target QP may redirect to more than one XRC SRQ. This means
that for work completions associated with a XRC TGT QP, the srq field
in the QP has no usage and the real XRC SRQ need to be retrived using
the information from the XRCETH placed into the CQE, do that.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 6a9200603d ("IB/mlx4: convert to idr_alloc()") forgot to remove
idr_pre_get() call in mlx4_ib_cm_paravirt_init(). It's unnecessary and
idr_pre_get() will soon be deprecated. Remove it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MAX_IDR_MASK is another weirdness in the idr interface. As idr covers
whole positive integer range, it's defined as 0x7fffffff or INT_MAX.
Its usage in idr_find(), idr_replace() and idr_remove() is bizarre.
They basically mask off the sign bit and operate on the rest, so if
the caller, by accident, passes in a negative number, the sign bit
will be masked off and the remaining part will be used as if that was
the input, which is worse than crashing.
The constant is visible in idr.h and there are several users in the
kernel.
* drivers/i2c/i2c-core.c:i2c_add_numbered_adapter()
Basically used to test if adap->nr is a negative number which isn't
-1 and returns -EINVAL if so. idr_alloc() already has negative
@start checking (w/ WARN_ON_ONCE), so this can go away.
* drivers/infiniband/core/cm.c:cm_alloc_id()
drivers/infiniband/hw/mlx4/cm.c:id_map_alloc()
Used to wrap cyclic @start. Can be replaced with max(next, 0).
Note that this type of cyclic allocation using idr is buggy. These
are prone to spurious -ENOSPC failure after the first wraparound.
* fs/super.c:get_anon_bdev()
The ID allocated from ida is masked off before being tested whether
it's inside valid range. ida allocated ID can never be a negative
number and the masking is unnecessary.
Update idr_*() functions to fail with -EINVAL when negative @id is
specified and update other MAX_IDR_MASK users as described above.
This leaves MAX_IDR_MASK without any user, remove it and relocate
other MAX_IDR_* constants to lib/idr.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Roland Dreier <roland@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Cc: "Marciniszyn, Mike" <mike.marciniszyn@intel.com>
Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Wolfram Sang <wolfram@the-dreams.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert to the much saner new idr interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Indicate memory windows support through device capabilities, kernel
verb entries and the relevant uverbs command mask entries.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Implement memory windows binding in mlx4_ib_post_send.
* Implement mlx4_ib_bind_mw by deferring to mlx4_ib_post_send.
* Rename MLX4_WQE_FMR_PERM_* flags to MLX4_WQE_FMR_AND_BIND_PERM_*,
indicating that they are used both for fast registration work
requests, and for memory window bind work requests.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement MW allocation and deallocation in mlx4_core and mlx4_ib.
Pass down the enable bind flag when registering memory regions.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Building qp.o triggers this gcc warning:
drivers/infiniband/hw/mlx4/qp.c: In function ‘mlx4_ib_post_send’:
drivers/infiniband/hw/mlx4/qp.c:1862:62: warning: ‘vlan’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/infiniband/hw/mlx4/qp.c:1752:6: note: ‘vlan’ was declared here
Looking at the code it is clear 'vlan' is only set and used if 'is_eth'
is non-zero. But by initializing 'vlan' to 0xffff, on
gcc (Ubuntu 4.7.2-22ubuntu1) 4.7.2
on x86-64 at least, we fix the warning, and the compiler was already
setting 'vlan' to 0 in the generated code, so there's no real downside.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
[ Get rid of unnecessary move of 'is_vlan' initialization. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Matches the way they're used, and actually lets at least x86-64 generate
better code:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-38 (-38)
function old new delta
mlx4_ib_post_send 4416 4378 -38
Signed-off-by: Roland Dreier <roland@purestorage.com>
MR deregistration fails when memory windows are bound to the MR.
Handle such failures by propagating them to the caller ULP.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove unused fields from the local invalidate WQE segment structure.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Delete successive tests to the same location. The code tested the result
of a previous allocation, that itself was already tested. It is changed to
test the result of the most recent allocation.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@s exists@
local idexpression y;
expression x,e;
@@
*if ( \(x == NULL\|IS_ERR(x)\|y != 0\) )
{ ... when forall
return ...; }
... when != \(y = e\|y += e\|y -= e\|y |= e\|y &= e\|y++\|y--\|&y\)
when != \(XT_GETPAGE(...,y)\|WMI_CMD_BUF(...)\)
*if ( \(x == NULL\|IS_ERR(x)\|y != 0\) )
{ ... when forall
return ...; }
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We have to decrement "i" before calling mlx4_ib_free_demux_ctx() or we
free something that wasn't allocated. That's fine for free_pv_object()
but it would lead to a NULL dereference calling mlx4_ib_free_demux_ctx().
The null dereference is because ->tun is NULL when we check:
if (!ctx->tun[i])
Also we didn't free ->sriov.demux[0] so it was a small leak.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
lockdep warns about taking a hard-irq-unsafe lock (sriov->id_map_lock)
inside a hard-irq-safe lock (sriov->going_down_lock).
Since id_map_lock is never taken in the interrupt context, we can
simply reverse the order of taking the two spinlocks, thus avoiding
the warning and the depencency.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
ConnectX-3 devices can use either 64- or 32-byte completion queue
entries (CQEs) and event queue entries (EQEs). Using 64-byte
EQEs/CQEs performs better because each entry is aligned to a complete
cacheline. This patch queries the HCA's capabilities, and if it
supports 64-byte CQEs and EQES the driver will configure the HW to
work in 64-byte mode.
The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.
Since this mode is global, userspace (libmlx4) must be updated to work
with the configured CQE size, and guests using SR-IOV virtual
functions need to know both EQE and CQE size.
In case one of the 64-byte CQE/EQE capabilities is activated, the
patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
they need an update to be able to work with the PPF. This is done by
changing the returned pf_context_behaviour not to be zero any more. In
case none of these capabilities is activated that value remains zero
and older guest drivers can run OK.
The SRIOV related flow is as follows
1. the PPF does the detection of the new capabilities using
QUERY_DEV_CAP command.
2. the PPF activates the new capabilities using INIT_HCA.
3. the VF detects if the PPF activated the capabilities using
QUERY_HCA, and if this is the case activates them for itself too.
Note that the VF detects that it must be aware to the new PF behaviour
using QUERY_FUNC_CAP. Steps 1 and 2 apply also for native mode.
User space notification is done through a new field introduced in
struct mlx4_ib_ucontext which holds device capabilities for which user
space must take action. This changes the binary interface so the ABI
towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
when **needed** i.e. only when the driver does use 64-byte CQEs or
future device capabilities which must be in sync by user space. This
practice allows to work with unmodified libmlx4 on older devices (e.g
A0, B0) which don't support 64-byte CQEs.
In order to keep existing systems functional when they update to a
newer kernel that contains these changes in VF and userspace ABI, a
module parameter enable_64b_cqe_eqe must be set to enable 64-byte
mode; the default is currently false.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
A client re-register event invokes cleanup of all MCGs. This is
required to protect against misbehaved guests leading to corruption of
join/leave database. However, since cleaning up the MCGs is a heavy
operation, it is pushed to a work queue for further processing.
Client re-register is also propagated to ULPs (e.g IPoIB).
However, since the cleanup is performed in a workqueue, the ULP could
leave and re-join groups before the cleanup occurs. In this case,
when the cleanup takes place, it prunes the (newly-joined) MCGs and
the ULP is left without actual MCGs while believing it joined them.
Fix this by setting the flushing flag before invoking the cleanup task
and clearing it after flushing is complete.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the MAD paravirtualization code, one of the checks performed when
forwarding QP1 (GSI) packets from wire to slave was a P_Key check: the
P_Key received in the MAD must be present in the guest's paravirtualized
P_Key table, and at least one of the (packet P_Key, guest P_Key) must
be a full-membership P_Key.
However, if everyone involved has only limited membership in the
default P_Key, then packets sent by full-member remote hosts arrive at
the PPF but are not passed on to the VFs with the current P_Key1 check.
Fix this as follows:
1. Don't care if P_Key received over wire is full or not. If it
successfully passed HW checks on the real QP1, then simply pass it
to guest regardless of whether the guest has full or limited
membership in its P_Key table.
2. If the guest (including paravirtualized master) has both full and
limited P_Key forms in its table, preferentially pass the
paravirtualized P_Key index of the full P_Key form in the tunnel
header.
3. In the multicast join flow (mlx4/mcg.c), use the index for the
default P_Key (wherever it is located) in replies generated from
within the mcg module (previously, P_Key index 0 was used in all
cases).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Line 110 uses UL as a compiler cast for the 0x constant, but it's not
large enough to hold a 64-bit value on a 32-bit arch.
Signed-off-by: Doug Ledford <dledford@redhat.com>
[ Use "-1" instead of "FFFFFFFFFFFFFFFFULL". - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
To avoid name conflicts:
drivers/video/riva/fbdev.c:281:9: sparse: preprocessor token MAX_LEVEL redefined
While at it, also make the other names more consistent and add
parentheses.
[akpm@linux-foundation.org: repair fallout]
[sfr@canb.auug.org.au: IB/mlx4: fix for MAX_ID_MASK to MAX_IDR_MASK name change]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: walter harms <wharms@bfs.de>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- mlx4 IB support for SR-IOV
- A couple of SRP initiator fixes
- Batch of nes hardware driver fixes
- Fix for long-standing use-after-free crash in IPoIB
- Other miscellaneous fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJQav4jAAoJEENa44ZhAt0hmL0QAJTuMdSOzYFd/NB38owJCNM2
kz/N1GlBm3z98fIlGo8u+lzgV2qxqZSAzsJsouMeK38KiAX3CL8HKe44A1QvTM6v
dXTNL4JFX24/YF+nlmMY8Av518I9Mkte3BZCnpYkBjVFBWe0ePwoRC/btfBXPDIV
0snq4OtjoBAn00dOOyuZ5PoyY9xf0z4UB0Gple9sM4mzEb8wVWdNDDPOiuPJc6fA
L+gk6HLkZDg54+QswafdKYwpeTq45wIKLmCdS3oUNmppMLVhZY8rECOwzSa+KiTr
/Yo19n+zl+IBlvjQHhmUqGHvdD17PaGlr+TckAsQqmVfXUH5qqpEnkF8FoEK59c5
YA3lVU8Sj4BPhJ7qX54CuN3767mZizakkxCr9iPRzABFTgzWVgcSgCrE8jjx4i0h
Pam+L5bmANFStgmGR8PmXiNgcrCUcEqYHsOWDDAnHa5ekb2nyv1JL1c18hlY9hC3
Xb1YTMZFwvofGza89hBu7oHrMbLOUc5kW2lBpvUn2nlyf3i0F8ISlVbVbNjFA54p
60/jHa2VOQ2CcJUJKnJOk4ajOOEfHnPtMn2q96XJ69Dp8+eSYEO/G+0i1OlChq4h
ClnG0Yp+NkT1o8WXMd7guDR+RsXt+DXIij5TiUWRIqnIlopIsMTRhNH28tMu4jQL
fgN5n987wru91ewdX4gW
=PAcy
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband updates from Roland Dreier:
"First batch of InfiniBand/RDMA changes for the 3.7 merge window:
- mlx4 IB support for SR-IOV
- A couple of SRP initiator fixes
- Batch of nes hardware driver fixes
- Fix for long-standing use-after-free crash in IPoIB
- Other miscellaneous fixes"
This merge also removes a new use of __cancel_delayed_work(), and
replaces it with the regular cancel_delayed_work() that is now irq-safe
thanks to the workqueue updates.
That said, I suspect the sequence in question should probably use
"mod_delayed_work()". I just did the minimal "don't use deprecated
functions" fixup, though.
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (45 commits)
IB/qib: Fix local access validation for user MRs
mlx4_core: Disable SENSE_PORT for multifunction devices
mlx4_core: Clean up enabling of SENSE_PORT for older (ConnectX-1/-2) HCAs
mlx4_core: Stash PCI ID driver_data in mlx4_priv structure
IB/srp: Avoid having aborted requests hang
IB/srp: Fix use-after-free in srp_reset_req()
IB/qib: Add a qib driver version
RDMA/nes: Fix compilation error when nes_debug is enabled
RDMA/nes: Print hardware resource type
RDMA/nes: Fix for crash when TX checksum offload is off
RDMA/nes: Cosmetic changes
RDMA/nes: Fix for incorrect MSS when TSO is on
RDMA/nes: Fix incorrect resolving of the loopback MAC address
mlx4_core: Fix crash on uninitialized priv->cmd.slave_sem
mlx4_core: Trivial cleanups to driver log messages
mlx4_core: Trivial readability fix: "0X30" -> "0x30"
IB/mlx4: Create paravirt contexts for VFs when master IB driver initializes
mlx4: Modify proxy/tunnel QP mechanism so that guests do no calculations
mlx4: Paravirtualize Node Guids for slaves
mlx4: Activate SR-IOV mode for IB
...
When we have VFs and PFs on same host, the VFs are activated within
the mlx4_core module before the mlx4_ib kernel module is loaded.
When the mlx4_ib module initializes the PF (master), it now creates
MAD paravirtualization contexts for any VFs that already active.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Previously, the structure of a guest's proxy QPs followed the
structure of the PPF special qps (qp0 port 1, qp0 port 2, qp1 port 1,
qp1 port 2, ...). The guest then did offset calculations on the
sqp_base qp number that the PPF passed to it in QUERY_FUNC_CAP().
This is now changed so that the guest does no offset calculations
regarding proxy or tunnel QPs to use. This change frees the PPF from
needing to adhere to a specific order in allocating proxy and tunnel
QPs.
Now QUERY_FUNC_CAP provides each port individually with its proxy
qp0, proxy qp1, tunnel qp0, and tunnel qp1 QP numbers, and these are
used directly where required (with no offset calculations).
To accomplish this change, several fields were added to the phys_caps
structure for use by the PPF and by non-SR-IOV mode:
base_sqpn -- in non-sriov mode, this was formerly sqp_start.
base_proxy_sqpn -- the first physical proxy qp number -- used by PPF
base_tunnel_sqpn -- the first physical tunnel qp number -- used by PPF.
The current code in the PPF still adheres to the previous layout of
sqps, proxy-sqps and tunnel-sqps. However, the PPF can change this
layout without affecting VF or (paravirtualized) PF code.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This is necessary in order to support > 1 VF/PF in a VM for software
that uses the node guid as a discriminator, such as librdmacm.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove the error returns for IB ports from mlx4_ib_add,
mlx4_INIT_PORT_wrapper, and mlx4_CLOSE_PORT_wrapper.
Currently, SRIOV is supported only for devices for which the
link layer is IB on all ports; RoCE support will be added later.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Allow only master to change node description.
2. Prevent AH leakage in send mads.
3. Take device part number from PCI structure, so that guests see the
VF part number (and not the PF part number).
4. Place the device revision ID into caps structure at startup.
5. SET_PORT in update_gids_task needs to go through wrapper on master.
6. In mlx4_ib_event(), PORT_MGMT_EVENT needs be handled in a work
queue on the master, since it propagates events to slaves using
GEN_EQE.
7. Do not support FMR on slaves.
8. Add spinlock to slave_event(), since it is called both in interrupt
context and in process context (due to 6 above, and also if
smp_snoop is used). This fix was found and implemented by Saeed
Mahameed <saeedm@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This directory is added only for the master -- slaves do not have it.
The sysfs iov directory is used to manage and examine the port P_Key
and guid paravirtualization.
Under iov/ports, the administrator may examine the gid and P_Key tables
as they are present in the device (and as are seen in the "network
view" presented to the SM).
Under the iov/<pci slot number> directories, the admin may map the
index numbers in the physical tables (as under iov/ports) to the
paravirtualized index numbers that guests see.
For example, if the administrator, for port 1 on guest 2 maps physical
pkey index 10 to virtual index 1, then that guest, whenever it uses
its pkey index 1, will actually be using the real pkey index 10.
Based on patch from Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
P_Key change and guid change events are not of interest to all slaves,
but only to those slaves which "see" the table slots whose contents
have change.
For example, if the guid at port 1, index 5 has changed in the PPF, we
wish to propagate the gid-change event only to the function which has
that guid index mapped to its port/guid table (in this case it is
slave #5). Other functions should not get the event, since the event
does not affect them.
Similarly with P_Keys -- P_Key change events are forwarded only to
slaves which have that P_Key index mapped to their virtual P_Key table.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For IB ports, we paravirtualize the GUID at index 0 on slaves. The
GUID at index 0 seen by a slave is the actual GUID occupying the GUID
table at the slave-id index.
The driver, by default, requests at startup time that subnet manager
populate its entire guid table with GUIDs. These guids are then mapped
(paravirtualized) to the slaves, and appear for each slave as its GUID
at index 0.
Until each slave has such a guid, its port status is DOWN.
The guid table is cached to support special QP paravirtualization, and
event propagation to slaves on guid change (we test to see if the guid
really changed before propagating an event to the slave).
To support this caching, add capability to __mlx4_ib_query_gid() to
obtain the network view (i.e., physical view) gid at index X, not just
the host (paravirtualized) view.
Based on a patch from Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In CM para-virtualization:
1. Incoming requests are steered to the correct vHCA according to the
embedded GID.
2. Communication IDs on outgoing requests are replaced by a globally
unique ID, generated by the PPF, since there is no synchronization
of ID generation between guests (and so these IDs are not
guaranteed to be globally unique). The guest's comm ID is stored,
and is returned to the response MAD when it arrives.
Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
MCG paravirtualization support includes:
- Creating multicast groups by VFs, and keeping accounting of them
- Leaving multicast groups by VFs
- Updating SM only with real changes in the overall picture of MCGs status
- Creation of MGID=0 groups (let SM choose MGID)
Note that the MCG module maintains its own internal MCG object
reference counts. The reason for this is that the IB core is used to
track only the multicast groups joins generated by the PF it runs
over. The PF IB core layer is unaware of slaves, so it cannot be used
to keep track of MCG joins they generate.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The MAD_IFC firmware command fulfills two functions.
First, it is used in the QP0/QP1 MAD-handling flow to obtain
information from the FW (for answering queries), and for setting
variables in the HCA (MAD SET packets).
For this, MAD_IFC should provide the FW (physical) view of the data.
This is the view that OpenSM needs. We call this the "network view".
In the second case, MAD_IFC is used by various verbs to obtain data
regarding the local HCA (e.g., ib_query_device()). We call this the
"host view".
This data needs to be paravirtualized.
MAD_IFC therefore needs a wrapper function, and also needs another
flag indicating whether it should provide the network view (when it is
called by ib_process_mad in special-qp packet handling), or the host
view (when it is called while implementing a verb).
There are currently 2 flag parameters in mlx4_MAD_IFC already:
ignore_bkey and ignore_mkey. These two parameters are replaced by a
single "mad_ifc_flags" parameter, with different bits set for each
flag. A third flag is added: "network-view/host-view".
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Special QPs are paravirtualized.
vHCAs are not given direct access to QP0/1. Rather, these QPs are
operated by a special context hosted by the PF, which mediates access
to/from vHCAs. This is done by opening a "tunnel" per vHCA port per
QP0/1. A tunnel comprises a pair of UD QPs: a "Tunnel QP" in the
PF-context and a "Proxy QP" in the vHCA. All vHCA MAD traffic must
pass through the corresponding tunnel. vHCA QPs cannot be assigned to
VL15 and are denied of the well-known QKey.
Outgoing messages are "de-multiplexed" (i.e., directed to the wire via
the real special QP).
Incoming messages are "multiplexed" (i.e. steered by the PPF to the
correct VF or to the PF)
QP0 access is restricted to the PF vHCA. VF vHCAs also have (virtual)
QP0s, but they never receive any SMPs and all SMPs sent are discarded.
QP1 traffic is allowed for all vHCAs, but special care is required to
bridge the gap between the host and network views.
Specifically:
- Transaction IDs are mapped to guarantee uniqueness among vHCAs
- CM para-virtualization
o Incoming requests are steered to the correct vHCA according to the embedded GID
o Local communication IDs are mapped to ensure uniqueness among vHCAs
(see the patch that adds CM paravirtualization.)
- Multicast para-virtualization
o The PF context aggregates membership state from all vHCAs
o The SA is contacted only when the aggregate membership changes
o If the aggregate does not change, the PF context will provide the
requesting vHCA with the proper response.
(see the patch that adds multicast group paravirtualization)
Incoming MADs are steered according to:
- the DGID If a GRH is present
- the mapped transaction ID for response MADs
- the embedded GID in CM requests
- the remote communication ID in other CM messages
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This requires:
1. Replacing the paravirtualized P_Key index (inserted by the guest)
with the real P_Key index.
2. For UD QPs, placing the guest's true source GID index in the
address path structure mgid field, and setting the ud_force_mgid
bit so that the mgid is taken from the QP context and not from the
WQE when posting sends.
3. For UC and RC QPs, placing the guest's true source GID index in the
address path structure mgid field.
4. For tunnel and proxy QPs, setting the Q_Key value reserved for that
proxy/tunnel pair.
Since not all the above adjustments occur in all the QP transitions,
the QP transitions require separate wrapper functions.
Secondly, initialize the P_Key virtualization table to its default
values: Master virtualized table is 1-1 with the real P_Key table,
guest virtualized table has P_Key index 0 mapped to the real P_Key
index 0, and all the other P_Key indices mapped to the reserved
(invalid) P_Key at index 127.
Finally, add logic in smp_snoop for maintaining the phys_P_Key_cache.
and generating events on the master only if a P_Key actually changed.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allocate SR-IOV paravirtualization resources and MAD demuxing contexts
on the master.
This has two parts. The first part is to initialize the structures to
contain the contexts. This is done at master startup time in
mlx4_ib_init_sriov().
The second part is to actually create the tunneling resources required
on the master to support a slave. This is performed the master
detects that a slave has started up (MLX4_DEV_EVENT_SLAVE_INIT event
generated when a slave initializes its comm channel).
For the master, there is no such startup event, so it creates its own
tunneling resources when it starts up. In addition, the master also
creates the real special QPs. The ib_core layer on the master causes
creation of proxy special QPs, since the master is also
paravirtualized at the ib_core layer.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Introduce the basic SR-IOV parvirtualization context objects for
multiplexing and demultiplexing MADs.
2. Introduce support for the new proxy and tunnel QP types.
This patch introduces the objects required by the master for managing
QP paravirtualization for guests.
struct mlx4_ib_sriov is created by the master only.
It is a container for the following:
1. All the info required by the PPF to multiplex and de-multiplex MADs
(including those from the PF). (struct mlx4_ib_demux_ctx demux)
2. All the info required to manage alias GUIDs (i.e., the GUID at
index 0 that each guest perceives. In fact, this is not the GUID
which is actually at index 0, but is, in fact, the GUID which is at
index[<VF number>] in the physical table.
3. structures which are used to manage CM paravirtualization
4. structures for managing the real special QPs when running in SR-IOV
mode. The real SQPs are controlled by the PPF in this case. All
SQPs created and controlled by the ib core layer are proxy SQP.
struct mlx4_ib_demux_ctx contains the information per port needed
to manage paravirtualization:
1. All multicast paravirt info
2. All tunnel-qp paravirt info for the port.
3. GUID-table and GUID-prefix for the port
4. work queues.
struct mlx4_ib_demux_pv_ctx contains all the info for managing the
paravirtualized QPs for one slave/port.
struct mlx4_ib_demux_pv_qp contains the info need to run an individual
QP (either tunnel qp or real SQP).
Note: We made use of the 2 most significant bits in enum
mlx4_ib_qp_flags (based on enum ib_qp_create_flags in ib_verbs.h).
We need these bits in the low-level driver for internal purposes.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Unlike other parts of the mlx4_ib code, the function build_mlx_header()
doesn't check if the iboe netdev of the given port is valid before
dereferencing it, which can cause a crash if the ethernet interface
has already been taken down.
Fix this by checking for a valid netdev pointer before using it to get
the port MAC address.
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Updates to the qib low-level driver
- First chunk of changes for SR-IOV support for mlx4 IB
- RDMA CM support for IPv6-only binding
- Other misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJQDXhUAAoJEENa44ZhAt0hZxwQAJydS9f9me31AGa45SAq8rA8
6e3LgnQ6jS38he7LZZdSsT7g25jROCKOcTj6VUkNSVAVSkK8zcp7wngjDxw50IK6
FF9hNeljMkWBflOhxzB34DRGCW4b4J+Yt1o7v1RtawiG/Mhri/6imKL0Aqjt4GwX
a1MPZn+xI2osLujfdHJtATPWWB9jCaXdFe4DJUNPdqJhS6TN7s8OP3XMiqoJtdaV
ptHeSSbjdUR1mg/h2LU2FVmWHXNSBxn7MEsrBBRkQVyiEkXieFBLwPTp4DqmAUJf
xugm6Hf4sbiQ+QuU0baJODt56wveuYVQ4HKUzE0urUFXyU4TUB9blrehWZnKsRte
wo/w4nvVlnXgGhHH0Igq76RDX8aCwc/6uQJ/29oChWjrei3HE0LjmIlPAu0vAhyw
ViLe02/r2gKXQv1NxIqhPmGsJTZizg2mUk2eEJHHPQb/NGL7iNo6b+141AfURoqQ
goGmlGdzffCRpmo6FXFZ57RPVRS4gwMunCY/Pmvq5a2t4oZh8899l2+V3N7bCfvH
+JdavxAjia9U4IlgPsAqVaz8z8TyHOY/lEd75Wnw8q9www3kLx7hASvHsdwrS4LL
ihzECzsaXOSoIXQzghs+6iyp1pzmzE9ve8OqbIGJStzlrOLyn7Gjo+Ixwm0SLsTO
I7h9iJ1OMKFZ8bzmHsWb
=f09j
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA changes from Roland Dreier:
- Updates to the qib low-level driver
- First chunk of changes for SR-IOV support for mlx4 IB
- RDMA CM support for IPv6-only binding
- Other misc cleanups and fixes
Fix up some add-add conflicts in include/linux/mlx4/device.h and
drivers/net/ethernet/mellanox/mlx4/main.c
* tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (30 commits)
IB/qib: checkpatch fixes
IB/qib: Add congestion control agent implementation
IB/qib: Reduce sdma_lock contention
IB/qib: Fix an incorrect log message
IB/qib: Fix QP RCU sparse warnings
mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct and paravirtualize them
mlx4_core: Allow guests to have IB ports
mlx4_core: Implement mechanism for reserved Q_Keys
net/mlx4_core: Free ICM table in case of error
IB/cm: Destroy idr as part of the module init error flow
mlx4_core: Remove double function declarations
IB/mlx4: Fill the masked_atomic_cap attribute in query device
IB/mthca: Fill in sq_sig_type in query QP
IB/mthca: Warning about event for non-existent QPs should show event type
IB/qib: Fix sparse RCU warnings in qib_keys.c
net/mlx4_core: Initialize IB port capabilities for all slaves
mlx4: Use port management change event instead of smp_snoop
IB/qib: RCU locking for MR validation
IB/qib: Avoid returning EBUSY from MR deregister
IB/qib: Fix UC MR refs for immediate operations
...
Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To allow easy paravirtualization of P_Key and GID table sizes, keep
paravirtualized sizes in mlx4_dev->caps, but save the actual physical
sizes from FW in struct: mlx4_dev->phys_cap.
In addition, in SR-IOV mode, do the following:
1. Reduce reported P_Key table size by 1.
This is done to reserve the highest P_Key index for internal use,
for declaring an invalid P_Key in P_Key paravirtualization.
We require a P_Key index which always contain an invalid P_Key
value for this purpose (i.e., one which cannot be modified by
the subnet manager). The way to do this is to reduce the
P_Key table size reported to the subnet manager by 1, so that
it will not attempt to access the P_Key at index #127.
2. Paravirtualize the GID table size to 1. Thus, each guest sees
only a single GID (at its paravirtualized index 0).
In addition, since we are paravirtualizing the GID table size to 1, we
add paravirtualization of the master GID event here (i.e., we do not
do ib_dispatch_event() for the GUID change event on the master, since
its (only) GUID never changes).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the user queries for device capabilities, fill in the
masked_atomic_cap attribute with the real support level of atomic
capabilities instead of using a hard coded value.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Reviewed-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The port management change event can replace smp_snoop. If the
capability bit for this event is set in dev-caps, the event is used
(by the driver setting the PORT_MNG_CHG_EVENT bit in the async event
mask in the MAP_EQ fw command). In this case, when the driver passes
incoming SMP PORT_INFO SET mads to the FW, the FW generates port
management change events to signal any changes to the driver.
If the FW generates these events, smp_snoop shouldn't be invoked in
ib_process_mad(), or duplicate events will occur (once from the
FW-generated event, and once from smp_snoop).
In the case where the FW does not generate port management change
events smp_snoop needs to be invoked to create these events. The flow
in smp_snoop has been modified to make use of the same procedures as
in the fw-generated-event event case to generate the port management
events (LID change, Client-rereg, Pkey change, and/or GID change).
Port management change event handling required changing the
mlx4_ib_event and mlx4_dispatch_event prototypes; the "param" argument
(last argument) had to be changed to unsigned long in order to
accomodate passing the EQE pointer.
We also needed to move the definition of struct mlx4_eqe from
net/mlx4.h to file device.h -- to make it available to the IB driver,
to handle port management change events.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Define pr_fmt and add some pr_debug prints.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The driver is modified to support three operation modes.
If supported by firmware use the device managed flow steering
API, that which we call device managed steering mode. Else, if
the firmware supports the B0 steering mode use it, and finally,
if none of the above, use the A0 steering mode.
When the steering mode is device managed, the code is modified
such that L2 based rules set by the mlx4_en driver for Ethernet
unicast and multicast, and the IB stack multicast attach calls
done through the mlx4_ib driver are all routed to use the device
managed API.
When attaching rule using device managed flow steering API,
the firmware returns a 64 bit registration id, which is to be
provided during detach.
Currently the firmware is always programmed during HCA initialization
to use standard L2 hashing. Future work should be done to allow
configuring the flow-steering hash function with common, non
proprietary means.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Limit the max number of WQEs per QP reported when querying the
device, so that ib_create_qp() will not fail for a QP size that the
device claimed to support due to additional headroom WQEs being
allocated.
2. Limit qp resources accepted for ib_create_qp() to the limits
reported in ib_query_device(). In kernel space, make sure that the
limits returned to the caller following qp creation also lie within
the reported device limits. For userspace, report as before, and do
adjustment in libmlx4 (so as not to break ABI).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit e605b743f3 ("IB/mlx4: Increase the number of vectors (EQs)
available for ULPs") didn't handle correctly the case where there
aren't enough MSI-X vectors to increase the number of EQs, so only the
legacy EQs are allocated. This results in an attempt to memset() to
zero the EQ table which was never allocated and a kernel crash.
Fix this by checking in the teardown flow if the table of EQs was ever
allocated. Also remove some unneeded setting to zero of the EQ
related fields in struct mlx4_ib_dev.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We need to use a different loop index for mlx4_counter_alloc() and for
device_create_file() iterations: the mlx4_counter_alloc() loop index
is used in the error flow to free counters.
If the same loop index is used for device_create_file() and, say, the
device_create_file() loop fails on the first iteration, the allocated
counters will not be freed.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Enable IB ULPs to use a larger portion of the device EQs (which map to
IRQs). The mlx4_ib driver follows the mlx4_core framework of the EQs
to be divided among the device ports. In this scheme, for each IB
port, the number of allocated EQs follows the number of cores, subject
to other system constraints, such as number available MSI-X vectors.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
[ Replace one more printk_once() with pr_info_once(). - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Otherwise CM packets going over MLX QP1 get fixed scheduling priority 0.
We want CM packets to get the same scheduling priority, and therefore
map to the same SQ (Schedule Queue) and eventually TC (Traffic Class),
as the application requested for the actual QP used for the connection.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement raw packet QPs for Ethernet ports using the MLX transport (as
done by the mlx4_en Ethernet netdevice driver).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- fix memory leak in mlx4
- fix two problems with new MAD response generation code
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJPmYfMAAoJEENa44ZhAt0h06IP/jPJeH855wC2RgWXY2Fd+b6w
K/DQVE2MaX2f8EHuAB5MKoC0zzchKNu0GTMbrZMizbocDBmIU6ibF3MKx2rycSBC
FI906X7FFAeSyaywrI+k+ZGxVQYAck1QFDr44WiyJ400r8BGbx9ZECKvIyEwjaUa
RpRn4Pui4IjMV7GFqiziUsa9uo4xKYagjNfeInAZXelzIN76Coeae28dAIE2rPT/
BTcrtQ2JMG6H4AVGTImTP+Kqk1Cm7apBeHS4+SGWWTdw1WH4IcMqPmxHBUUtKTWt
UnlgpHquuC913VpkbkGeog32iAndhHgAkfs67Rs0IFwck127Il/DiPK/zir2zkZ2
YVilzYIxrSUYqPZYfKFcrKsnjfM+HJxBEPCfDjxfG0TjrVS8tnc75Uq1NvIt6e42
lcIclLDyZtdsMLmJlsTsT5cF7HlThSjpZDSgmMi2eeOI0RKoqRyyGGwiAIq7Ud2a
oRa+bheh4pBD2lvzd62uEkgMkTaToAz3I/vlbOTSwxYg2skoHUpwNPCsmstsQ7RU
BHi7Nl5DLNT5lNBTIouWvbilpXs8eXwgcNrQcWSYaPeY2aDsGAAdi9Q4BadENao9
dmxOtqjryctX7U7LGDRmNDIfiyPY5AodDUJjFux1JwqbdSCGf1Lz7uvDWu7gqL9V
TCJFAK54okexpoyV6Vgv
=iRrT
-----END PGP SIGNATURE-----
Merge tag 'ib-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband fixes from Roland Dreier:
"A few fixes for regressions introduced in 3.4-rc1:
- fix memory leak in mlx4
- fix two problems with new MAD response generation code"
* tag 'ib-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Fix memory leaks in ib_link_query_port()
IB/mad: Don't send response for failed MADs
IB/mad: Set 'D' bit in response for unhandled MADs
If the call to mlx4_MAD_IFC() fails in ib_link_query_port() we will
currently do 'return err;' which will leak 'in_mad' and 'out_mad'. We
should instead do 'goto out;' where we'll properly free the memory we
previously allocated.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the IB port is down, the active_speed value returned by the
MAD_IFC command is seven (7) which isn't among the defined IB speeds
in enum ib_port_speed, and this invalid speed value is passed up to
higher layers or applications who do port query.
Fix that by setting the speed to be SDR -- the lowest possible -- when
the port is down.
Signed-off-by: Roland Dreier <roland@purestorage.com>
To issue a port query, use the QUERY_(Ethernet)_PORT command instead
of the MAD_IFC command, since MAD_IFC attempts to query the firmware
IB SMA, which is irrelevant for IBoE ports.
This allows us to handle both 10Gb/s and 40Gb/s rates (e.g in sysfs),
using QDR speed (10Gb/s) and width of 1X or 4X.
Signed-off-by: Dotan Barak <dotanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If an erroneous CQE is polled in the first iteration (i.e. npolled ==
0), we don't update the consumer index and hence the hardware could
get a wrong notion of how many CQEs software polled. Fix this by
unconditionally updating the doorbell record. We could change the
check to be something like
if (npolled || err != -EAGAIN)
...
but it does not seem worth the effort since a posted write to memory
should not cost too much.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use a bit in wc_flags rather then a whole integer to hold the
"checksum OK" flag. By itself, this change doesn't reduce the size of
struct ib_wc on 64bit machines -- it stays on 56 bytes because of
padding. However, it will allow to add more fields in the future
without enlarging the struct. Also, it will let us have a unified
approach with future libibverbs checksum offload reporting, because a
bit flag doesn't break the library ABI.
This patch was suggested during conversation with Liran Liss
<liranl@mellanox.com>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
While doing the work for commit a6f7feae6d ("IB/mlx4: pass SMP
vendor-specific attribute MADs to firmware") we realized that the
firmware would respond on all sorts of vendor-specific MADs.
Therefore commit 97285b7817 ("mlx4_core: Add extended port
capabilities support") adds redundant code into the driver, since
there's no real reaon to maintain the extended capabilities of the
port, as they can be queried on demand (e.g the FDR10 capability).
This patch reverts commit 97285b7817 and removes the check for
extended caps from the mlx4_ib driver port query flow.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The kernel IB stack uses one enumeration for IB speed, which wasn't
explicitly specified in the verbs header file. Add that enum, and use
it all over the code.
The IB speed/width notation is also used by iWARP and IBoE HW drivers,
which use the convention of rate = speed * width to advertise their
port link rate.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
ConnectX devices have a limit on the number of mappings that can be
done on an FMR before having to call sync_tpt. The current
mlx4_ib driver reports the limit correctly in max_map_per_fmr in
.query_device(), but mlx4_core doesn't check it when actually
allocating FMRs.
Add a max_fmr_maps field to struct mlx4_caps and enforce this maximum
value on FMR allocations.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the opcode of a work request exceeds the range of valid opcodes,
return the pointer to the offending work request.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
In the current code, vendor-specific MADs (e.g with the FDR-10
attribute) are silently dropped by the driver, resulting in timeouts
at the sending side and inability to query/configure the relevant
feature. However, the ConnectX firmware is able to handle such MADs.
For unsupported attributes, the firmware returns a GET_RESPONSE MAD
containing an error status.
For example, for a FDR-10 node with LID 11:
# ibstat mlx4_0 1
CA: 'mlx4_0'
Port 1:
State: Active
Physical state: LinkUp
Rate: 40 (FDR10)
Base lid: 11
LMC: 0
SM lid: 24
Capability mask: 0x02514868
Port GUID: 0x0002c903002e65d1
Link layer: InfiniBand
Extended Port Query (EPI) vendor mad timeouts before the patch:
# smpquery MEPI 11 -d
ibwarn: [4196] smp_query_via: attr 0xff90 mod 0x0 route Lid 11
ibwarn: [4196] _do_madrpc: retry 1 (timeout 1000 ms)
ibwarn: [4196] _do_madrpc: retry 2 (timeout 1000 ms)
ibwarn: [4196] _do_madrpc: timeout after 3 retries, 3000 ms
ibwarn: [4196] mad_rpc: _do_madrpc failed; dport (Lid 11)
smpquery: iberror: [pid 4196] main: failed: operation EPI: ext port info query failed
EPI query works OK with the patch:
# smpquery MEPI 11 -d
ibwarn: [6548] smp_query_via: attr 0xff90 mod 0x0 route Lid 11
ibwarn: [6548] mad_rpc: data offs 64 sz 64
mad data
0000 0000 0000 0001 0000 0001 0000 0001
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
# Ext Port info: Lid 11 port 0
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x01
Signed-off-by: Jack Morgenstein <jackm@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Ira Weiny <weiny2@llnl.gov>
Cc: <stable@vger.kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For IBoE, SLs 0-7 are mapped to Ethernet 802.1Q user priority bits
(pbits) which are part of the VLAN tag, SLs 8-15 are reserved.
Under Ethernet, the ConnectX firmware treats (decode/encode) the four
bit SL field in various constructs such as QPC / UD WQE / CQE as PPP0
and not as 0PPP. This correlates well to the fact that within the
vlan tag the pbits are located in bits 15-13 and not 12-14.
The current code wasn't consistent around that area - the
encoding was correct for the IBoE QPC.path.schedule_queue field,
but was wrong for IBoE CQEs and when MLX header was built.
These inconsistencies resulted in wrong SL <--> wire 802.1Q pbits
mapping, which is fixed by using SL <--> PPP0 all around the place.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Conflicts:
net/bluetooth/l2cap_core.c
Just two overlapping changes, one added an initialization of
a local variable, and another change added a new local variable.
Signed-off-by: David S. Miller <davem@davemloft.net>
For SRIOV, some Hypervisor commands can be executed directly (native = 1).
Others should go through the command wrapper flow (for tracking resource
usage, for example, or for changing some HCA configurations that slaves
need to be notified of).
This patch sets the groundwork for this capability -- adding the correct
value of "native" in each case.
Note that if SRIOV is not activated, this parameter has no effect.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port mask now has additional state.
Port can be set as "none". In this case neither the mlx4_en or mlx4_ib
drivers take ownership of the port.
In multifunction mode there is an option to set the vfs as single ported devices.
(in single function mode, both physical ports belong to same function)
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cfcde11c3d ("IB/mlx4: Use flow counters on IBoE ports") added
code that sets elements of counters[] to -1 if no counter is allocated,
but then goes ahead and passes every entry to mlx4_counter_free() on
shutdown. This is a bad idea, especially if MLX4_DEV_CAP_FLAG_COUNTERS
isn't set so there isn't even an underlying bitmap to free from.
Tested-by: Sean Hefty <sean.hefty@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (62 commits)
mlx4_core: Deprecate log_num_vlan module param
IB/mlx4: Don't set VLAN in IBoE WQEs' control segment
IB/mlx4: Enable 4K mtu for IBoE
RDMA/cxgb4: Mark QP in error before disabling the queue in firmware
RDMA/cxgb4: Serialize calls to CQ's comp_handler
RDMA/cxgb3: Serialize calls to CQ's comp_handler
IB/qib: Fix issue with link states and QSFP cables
IB/mlx4: Configure extended active speeds
mlx4_core: Add extended port capabilities support
IB/qib: Hold links until tuning data is available
IB/qib: Clean up checkpatch issue
IB/qib: Remove s_lock around header validation
IB/qib: Precompute timeout jiffies to optimize latency
IB/qib: Use RCU for qpn lookup
IB/qib: Eliminate divide/mod in converting idx to egr buf pointer
IB/qib: Decode path MTU optimization
IB/qib: Optimize RC/UC code by IB operation
IPoIB: Use the right function to do DMA unmap pages
RDMA/cxgb4: Use correct QID in insert_recv_cqe()
RDMA/cxgb4: Make sure flush CQ entries are collected on connection close
...
There's no need to set the vlan-related fields in an IBoE send WQE
control segment:
- the vlan to be used by a UD QP is set in the datagram segment.
- for GSI (CM) QP, all the headers down to 8021q and MAC are built by
the software anyway.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The IBoE port MTU is derived from the corresponding Ethernet netdevice
MTU, which can support jumbo frames of 9K, and hence surely supports
the max IB mtu of 4K.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Set the extended active speeds based on the hardware configuration.
Signed-off-by: Marcel Apfelbaum <marcela@dev.mellanox.co.il>
Reviewed-by: Hal Rosenstock <hal@mellanox.com>
[ Move FDR-10 handling into ib_link_query_port(). - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow processes that share the same XRC domain to open an existing
shareable QP. This permits those processes to receive events on the
shared QP and transfer ownership, so that any process may modify the
QP. The latter allows the creating process to exit, while a remaining
process can still transition it for path migration purposes.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support the creation of XRC INI and TGT QPs. To handle the case where
a CQ or PD is not provided, we allocate them internally with the xrcd.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allow the user to create XRC SRQs. This patch is based on a patch
from Jack Morgenstrein <jackm@dev.mellanox.co.il>.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support creating and destroying XRC domains. Any sharing of the XRCD
is managed above the low-level driver.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, there is only a single ("basic") type of SRQ, but with XRC
support we will add a second. Prepare for this by defining an SRQ type
and setting all current users to IB_SRQT_BASIC.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Use the per port counter attached to all QPs created on that port to
implement port level packets/bytes performance counters a la IB.
Derived from a patch by Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allocate flow counter per Ethernet/IBoE port, and attach this counter
to all the QPs created on that port. Based on patch by Eli Cohen
<eli@mellanox.co.il>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
IBoE doesn't use LIDs. Use the GID change event to update the IB core
cache for addition/deletion of GIDs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The same packet steering mechanism would be used both for IB and Ethernet,
Both multicasts and unicasts.
This commit prepares the general infrastructure for this.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
HW revision is derived from device ID and rev id.
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
The newest device firmware stores IB vs. Ethernet protocol in two bits
in members_count field of multicast group table (0: Infiniband, 1:
Ethernet). When changing the QP members count for a multicast group,
it important not to reset this information. When calling multicast
attach first time, the protocol type should be specified. In this
patch we always set it IB, but in the future we will handle Ethernet
too. When looking for a QP, the protocol type shoud be checked too.
Signed-off-by: Aleksey Senin <alekseys@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Some systems have PCI addresses that don't fit in unsigned long (eg some
32-bit PowerPC 440 systems have 36-bit bus addresses). Fix up mlx4 drivers
by using phys_addr_t where appropriate, so we don't truncate any PCI
resource addresses before ioremapping them.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (42 commits)
IB/qib: Fix refcount leak in lkey/rkey validation
IB/qib: Improve SERDES tunning on QMH boards
IB/qib: Unnecessary delayed completions on RC connection
IB/qib: Issue pre-emptive NAKs on eager buffer overflow
IB/qib: RDMA lkey/rkey validation is inefficient for large MRs
IB/qib: Change QPN increment
IB/qib: Add fix missing from earlier patch
IB/qib: Change receive queue/QPN selection
IB/qib: Fix interrupt mitigation
IB/qib: Avoid duplicate writes to the rcv head register
IB/qib: Add a few new SERDES tunings
IB/qib: Reset packet list after freeing
IB/qib: New SERDES init routine and improvements to SI quality
IB/qib: Clear WAIT_SEND flags when setting QP to error state
IB/qib: Fix context allocation with multiple HCAs
IB/qib: Fix multi-Florida HCA host panic on reboot
IB/qib: Handle transitions from ACTIVE_DEFERRED to ACTIVE better
IB/qib: UD send with immediate receive completion has wrong size
IB/qib: Set port physical state even if other fields are invalid
IB/qib: Generate completion callback on errors
...
ib_create_send_mad() can return ERR_PTR(-ENOMEM) here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_free_cq_buf() should not be called under spin_lock_irq() since
it calls dma_free_coherent(), which needs irqs enabled. Fix this by
deferring the free to outside the locked region.
This was found due to the
WARN_ON(irqs_disabled());
in swiotlb_free_coherent().
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use netif_running() and netif_carrier_ok() to report link state,
exactly as is done to report Ethernet link state in sysfs.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The link rate is the product of the link speed in the link width. For
Etherent ports the rate is 10G, so we use 1 for the width and 4 for
speed to get the correct rate.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We must fully update the control segment before marking it as valid,
so that hardware doesn't start executing it before we're ready.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Move VLAN control bit setting to before wmb(). - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
dev_base_lock is the legacy way to lock the device list, and is planned
to disappear. (writers hold RTNL, readers hold RCU lock)
Convert rdma_translate_ip() and update_ipv6_gids() to RCU locking.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (63 commits)
IB/qib: clean up properly if pci_set_consistent_dma_mask() fails
IB/qib: Allow driver to load if PCIe AER fails
IB/qib: Fix uninitialized pointer if CONFIG_PCI_MSI not set
IB/qib: Fix extra log level in qib_early_err()
RDMA/cxgb4: Remove unnecessary KERN_<level> use
RDMA/cxgb3: Remove unnecessary KERN_<level> use
IB/core: Add link layer type information to sysfs
IB/mlx4: Add VLAN support for IBoE
IB/core: Add VLAN support for IBoE
IB/mlx4: Add support for IBoE
mlx4_en: Change multicast promiscuous mode to support IBoE
mlx4_core: Update data structures and constants for IBoE
mlx4_core: Allow protocol drivers to find corresponding interfaces
IB/uverbs: Return link layer type to userspace for query port operation
IB/srp: Sync buffer before posting send
IB/srp: Use list_first_entry()
IB/srp: Reduce number of BUSY conditions
IB/srp: Eliminate two forward declarations
IB/mlx4: Signal node desc changes to SM by using FW to generate trap 144
IB: Replace EXTRA_CFLAGS with ccflags-y
...
This patch allows IBoE traffic to be encapsulated in 802.1Q tagged
VLAN frames. The VLAN tag is encoded in the GID and derived from it
by a simple computation.
The netdev notifier callback is modified to catch VLAN device
addition/removal and the port's GID table is updated to reflect the
change, so that for each netdevice there is an entry in the GID table.
When the port's GID table is exhausted, GID entries will not be added.
Only children of the main interfaces can add to the GID table; if a
VLAN interface is added on another VLAN interface (e.g. "vconfig add
eth2.6 8"), then that interfaces will not add an entry to the GID
table.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:
GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.
The 3 bits user priority field of the packets are identical to the 3
bits of the SL.
In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IBoE to mlx4_ib. The bulk of the code is handling the
new address vector fields; mlx4 needs the MAC address of a remote node
to include it in a WQE (for datagrams) or in the QP context (for
connected QPs). Address resolution is done by assuming all unicast
GIDs are either link-local IPv6 addresses.
Multicast group attach/detach needs to update the NIC's multicast
filters; but since attaching a QP to a multicast group can be done
before the QP is bound to a port, for IBoE we need to keep track of
all multicast groups that a QP is attached too before it transitions
from INIT to RTR (since it does not have a port in the INIT state).
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Many things cleaned up and otherwise monkeyed with; hope I didn't
introduce too many bugs. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The Node Description cannot be changed via MADs (it is read-only).
Until now, it was changed in the driver via sysfs, and the new Node
Description was simply inserted by the driver into MAD responses
(replacing the description returned by FW).
System startup scripts use the sysfs interface to change the node
description at driver startup to show the hostname, etc. However, this
has a race condition: the SM could discover the original FW node
description rather than the system-specific description if it queried the
port before the startup scripts finish running.
For mlx4, we fix this with a new FW command (SET_NODE) that allows
passing the new node description to FW. When this command is invoked,
FW sends a trap 144 to the SM. When it gets this trap, the SM can
query the node to obtain the new node description -- thus eliminating
the effects of the race.
This patch simply calls SET_NODE command when a new node description
is entered via sysfs (thus causing trap 144 to be issued by the FW).
We ignore all failures of the SET_NODE command (including those caused
by using a device FW that predates the SET_NODE command), since in
that case things work just as before.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for packing IBoE packet headers.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Clean up and fix ib_ud_header_init() a bit. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the limit on the size of max fast registration WRs that can be
posted to match hardware capabilities.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter to ib_register_device() so that low-level device
drivers can pass in a pointer to a callback function that will be
called for each port that is registered in sysfs. This allows
low-level device drivers to create files in
/sys/class/infiniband/<hca>/ports/<N>/
without having to poke through the internals of the RDMA sysfs handling.
There is no need for an unregister function since the kobject
reference will go to zero when ib_unregister_device() is called.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for masked atomic operations (masked compare and swap,
masked fetch and add).
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Check correct variable for allocation failure
RDMA/nes: Correct cap.max_inline_data assignment in nes_query_qp()
RDMA/cm: Set num_paths when manually assigning path records
IB/cm: Fix device_create() return value check
The intent here is to check the "mfrpl->mapped_page_list" allocation.
We checked "mfrpl->ibfrpl.page_list" earlier.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
ib_ud_header_init() first clears header and then fills up the various
fields. Later on, it tests header->immediate_present, which it has
already cleared, so the condition is always false. Fix this by adding
an immediate_present parameter and setting header->immediate_present
as is done with grh_present. Also remove unused calculation of
header_len.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_qp already holds a pointer to the ib device. No need to dive to the
hw device object to retrieve it.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In mlx4_ib_post_recv(), we should check the queue for overflow using
recv_cq instead of send_cq (current code looks like a copy-and-paste
mistake).
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
As for memfree mthca hardware, ConnectX also requires SRQ WQE scatter
entries to be initialized with the invalid L_Key at SRQ creation time.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code has a limitation: an LSO header is not allowed to cross a
64 byte boundary. This patch removes this limitation by setting the
WQE RR for large headers thus allowing LSO headers of any size. The
extra buffer reserved for MLX4_IB_QP_LSO QPs has been doubled, from 64
to 128 bytes, assuming this is reasonable upper limit for header
length. Also, this patch will cause IB_DEVICE_UD_TSO to be set only
for HCA FW versions that set MLX4_DEV_CAP_FLAG_BLH; e.g. FW version
2.6.000 and higher.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is no such flag DE - the field is reserved and should be zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Userspace apps are supposed to release all ib device resources if they
receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However, the
app has no way of knowing when the device has come back up, except to
repeatedly attempt ibv_open_device() until it succeeds.
However, currently there is no protection against the open succeeding
while the device is in being removed following the fatal event. In
this case, the open will succeed, but as a result the device waits in
the middle of its removal until the new app releases its resources --
and the new app will not do so, since the open succeeded at a point
following the fatal event generation.
This patch adds an "active" flag to the device. The active flag is set
to false (in the fatal event flow) before the "fatal" event is
generated, so any subsequent ibv_dev_open() call to the device will
fail until the device comes back up, thus preventing the above
deadlock.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_lock_cqs()/mlx4_ib_unlock_cqs() are helper functions that
lock/unlock both CQs attached to a QP in the proper order to avoid
AB-BA deadlocks. Annotate this so sparse can understand what's going
on (and warn us if we misuse these functions).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Replace open-coded reimplementations with printk_once().
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ConnectX Programmer's Reference Manual states that the "SO" bit
must be set when posting Fast Register and Local Invalidate send work
requests. When this bit is set, the work request will be executed
only after all previous work requests on the send queue have been
executed. (If the bit is not set, Fast Register and Local Invalidate
WQEs may begin execution too early, which violates the defined
semantics for these operations)
This fixes the issue with NFS/RDMA reported in
<http://lists.openfabrics.org/pipermail/general/2009-April/059253.html>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The low-level mlx4 driver modified the page-list addresses for fast
register work requests post send to big-endian, and set a "present"
bit. This caused problems later when the consumer attempted to unmap
the pages using the page-list (using the list addresses which were
assumed to be still in CPU-endian order). Fix the mlx4 driver to
allocate two buffers and use a private buffer for the hardware-format
bus addresses.
This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1571>,
an NFS/RDMA server crash. The cause of the crash was found by Vu Pham
of Mellanox. The fix is along the lines suggested by Steve Wise in
comment #21 in bug 1571.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The PAT work on x86 has finally made pgprot_writecombine() a usable API
for modular drivers. As the comment indicates, this is exactly what we
want to use in mlx4_ib to map BlueFlame pages up to userspace, since
using WC for these pages improves small message latency significantly.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
According to the ConnectX programmer's reference manual, all
operations should be stopped, all QPs should be torn down and all WQEs
flushed before the CLOSE_PORT command is invoked. In some cases
reversing the order of operations (as implemented now) could cause
a loss of completions.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When snooping a PortInfo MAD, its client_reregister bit is checked.
If the bit is ON then a CLIENT_REREGISTER event is dispatched,
otherwise a LID_CHANGE event is dispatched. This way of decision
ignores the cases where the MAD changes the LID along with an
instruction to reregister (so a necessary LID_CHANGE event won't be
dispatched) or the MAD is neither of these (and an unnecessary
LID_CHANGE event will be dispatched).
This causes problems at least with IPoIB, which will do a "light"
flush on reregister, rather than the "heavy" flush required due to a
LID change.
Fix this by dispatching a CLIENT_REREGISTER event if the
client_reregister bit is set, but also compare the LID in the MAD to
the current LID. If and only if they are not identical then a
LID_CHANGE event is dispatched.
Signed-off-by: Moni Shoua <monis@voltaire.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Yossi Etigin <yosefe@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The base versions handle constant folding just fine, use them
directly. The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.
This patch does not affect code generation at all.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current work request posting code writes the LSO segment before
writing any data segments. This leaves a window where the LSO segment
overwrites the stamping in one cacheline that the HCA prefetches
before the rest of the cacheline is filled with the correct data
segments. When the HCA processes this work request, a local
protection error may result.
Fix this by saving the LSO header size field off and writing it only
after all data segments are written. This fix is a cleaned-up version
of a patch from Jack Morgenstein <jackm@dev.mellanox.co.il>.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1383>.
Reported-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/iser: Add dependency on INFINIBAND_ADDR_TRANS
IPoIB: Do not join broadcast group if interface is brought down
RDMA/nes: Fix for NIPQUAD removal
IPoIB: Fix loss of connectivity after bonding failover on both sides
IB/mlx4: Don't register IB device for adapters with no IB ports
mlx4_core: Fix warning from min()
IB/ehca: spin_lock_irqsave() takes an unsigned long
If the mlx4_ib driver finds an adapter that has only ethernet ports, the
current code will register an IB device with 0 ports. Nothing useful or
sensible can be done with such a device, so just skip registering it.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit f780a9f1 ("mlx4_core: Add ethernet fields to CQE struct")
introduced a bug in how wc->sl is set in mlx4_ib_poll_one() -- since
cqe->sl_vid is a big-endian value, the shift must be done after
converting to host endianness.
This bug was found using sparse endianness checking.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When resizing a CQ, when copying over unpolled CQEs from the old CQE
buffer to the new buffer, the ownership bit must be set appropriately
for the new buffer, or the ownership bit in the new buffer gets
corrupted.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When using MSI-X mode, create a completion event queue for each CPU.
Report the number of completion EQs in a new struct mlx4_caps member,
num_comp_vectors, and extend the mlx4_cq_alloc() interface with a
vector parameter so that consumers can specify which completion EQ
should be used to report events for the CQ being created.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When resizing a CQ, MTTs associated with the old CQE buffer were not
freed. As a result, if any app used resize CQ repeatedly, all MTTs
were eventually exhausted, which led to all memory registration
operations failing until the driver is reloaded.
Once the RESIZE_CQ command returns successfully from FW, FW no longer
accesses the old CQ buffer, so it is safe to deallocate the MTT
entries used by the old CQ buffer.
Finally, if the RESIZE_CQ command fails, the MTTs allocated for the
new CQEs buffer also need to be de-allocated.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1416>.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set mr->umem to NULL in mlx4_ib_alloc_fast_reg_mr(). Otherwise
ib_dereg_mr() may invoke ib_umem_release() on a random pointer value
and get an oops.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Multi-protocol adapters support different port types. Each consumer
of mlx4_core queries for supported port types; in particular mlx4_ib
can no longer assume that all physical ports belong to it. Port type
is configured through a sysfs interface. When the type of a port is
changed, all mlx4 interfaces are unregistered, and then registered
again with the new port types.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To allow allocating an aligned range of consecutive QP numbers, add an
interface to reserve an aligned range of QP numbers and have the QP
allocation function always take a QP number.
This will be used for RSS support in the mlx4_en Ethernet driver and
also potentially by IPoIB RSS support.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set RLKEY bit in the HW context for kernel QPs so that kernel QPs can
use the reserved L_Key for memory reference.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IPoIB: Fix deadlock on RTNL between bcast join comp and ipoib_stop()
RDMA/nes: Fix client side QP destroy
IB/mlx4: Fix up fast register page list format
mlx4_core: Set RAE and init mtt_sz field in FRMR MPT entries
Byte swap the addresses in the page list for fast register work requests
to big endian to match what the HCA expectx. Also, the addresses must
have the "present" bit set so that the HCA knows it can access them.
Otherwise the HCA will fault the first time it accesses the memory
region.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Initialize the L_Key and R_Key for memory regions returned from
mlx4_ib_alloc_fast_reg_mr(). Otherwise callers just get garbage for
the memory keys and can't do anything useful with these MRs.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Current code limits the max message size to 2K for UD QPs, while MTU
might be as big as 4K. This patch sets the maximum message size to
4K, which is needed for UD to work correctly on fabrics with a 4K MTU.
Signed-off-by: Alex Naslednikov <xalex@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add ethernet-related fields to struct mlx4_cqe so that the mlx4_en
ethernet NIC driver can share the same definition.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Update existing Mellanox copyright lines to 2008, and add such lines
to files where they are missing.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for the following operations to mlx4 when device firmware
supports them:
- Send with invalidate and local invalidate send queue work requests;
- Allocate/free fast register MRs;
- Allocate/free fast register MR page lists;
- Fast register MR send queue work requests;
- Local DMA L_Key.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code uses kmalloc() and then just does a bitwise OR operation on
qp->flags in create_qp_common(), which means that qp->flags may
potentially have some unintended bits set. This patch uses kzalloc()
and avoids further explicit clearing of structure members, which also
shrinks the code:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-65 (-65)
function old new delta
create_qp_common 2024 1959 -65
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK
flag by using the per-multicast group loopback blocking feature of
mlx4 hardware.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 65adfa91 ("IB/mlx4: Fix RESET to RESET and RESET to ERROR
transitions") added some extra code to handle a QP state transition
from RESET to ERROR. However, the latest 1.2.1 version of the IB spec
has clarified that this transition is actually not allowed, so we can
remove this extra code again.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX HCAs support the IB_MGMT_CLASS_CONG_MGMT management class, so
process MADs of this class through the MAD_IFC firmware command.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX returns the max message size it supports through the
QUERY_DEV_CAP firmware command. When modifying a QP to RTR, the max
message size for the QP must be specified. This value must not exceed
the value declared through QUERY_DEV_CAP. The current code ignores
the max allowed size and unconditionally sets the value to 2^31. This
patch sets all QPs to the max value allowed as returned from firmware.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch adds support for the IB "base memory management extension"
(BMME) and the equivalent iWARP operations (which the iWARP verbs
mandates all devices must implement). The new operations are:
- Allocate an ib_mr for use in fast register work requests.
- Allocate/free a physical buffer lists for use in fast register work
requests. This allows device drivers to allocate this memory as
needed for use in posting send requests (eg via dma_alloc_coherent).
- New send queue work requests:
* send with remote invalidate
* fast register memory region
* local invalidate memory region
* RDMA read with invalidate local memory region (iWARP only)
Consumer interface details:
- A new device capability flag IB_DEVICE_MEM_MGT_EXTENSIONS is added
to indicate device support for these features.
- New send work request opcodes IB_WR_FAST_REG_MR, IB_WR_LOCAL_INV,
IB_WR_RDMA_READ_WITH_INV are added.
- A new consumer API function, ib_alloc_mr() is added to allocate
fast register memory regions.
- New consumer API functions, ib_alloc_fast_reg_page_list() and
ib_free_fast_reg_page_list() are added to allocate and free
device-specific memory for fast registration page lists.
- A new consumer API function, ib_update_fast_reg_key(), is added to
allow the key portion of the R_Key and L_Key of a fast registration
MR to be updated. Consumers call this if desired before posting
a IB_WR_FAST_REG_MR work request.
Consumers can use this as follows:
- MR is allocated with ib_alloc_mr().
- Page list memory is allocated with ib_alloc_fast_reg_page_list().
- MR R_Key/L_Key "key" field is updated with ib_update_fast_reg_key().
- MR made VALID and bound to a specific page list via
ib_post_send(IB_WR_FAST_REG_MR)
- MR made INVALID via ib_post_send(IB_WR_LOCAL_INV),
ib_post_send(IB_WR_RDMA_READ_WITH_INV) or an incoming send with
invalidate operation.
- MR is deallocated with ib_dereg_mr()
- page lists dealloced via ib_free_fast_reg_page_list().
Applications can allocate a fast register MR once, and then can
repeatedly bind the MR to different physical block lists (PBLs) via
posting work requests to a send queue (SQ). For each outstanding
MR-to-PBL binding in the SQ pipe, a fast_reg_page_list needs to be
allocated (the fast_reg_page_list is owned by the low-level driver
from the consumer posting a work request until the request completes).
Thus pipelining can be achieved while still allowing device-specific
page_list processing.
The 32-bit fast register memory key/STag is composed of a 24-bit index
and an 8-bit key. The application can change the key each time it
fast registers thus allowing more control over the peer's use of the
key/STag (ie it can effectively be changed each time the rkey is
rebound to a page list).
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The idea is that for QPs with fixed size work requests (eg selective
signaling QPs), before stamping the WQE, we read the value of the DS
field, which gives the effective size of the descriptor as used in the
previous post. Then we stamp only that area, since the rest of the
descriptor is already stamped.
When initializing the send queue buffer, make sure the DS field is
initialized to the max descriptor size so that the subsequent stamping
will be done on the entire descriptor area.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When creating a kernel QP where the consumer asked for a send queue
with lots of scatter/gater entries, set_kernel_sq_size() incorrectly
returned an error if the send queue stride is larger than the
hardware's maximum send work request descriptor size. This is not a
problem; the only issue is to make sure that the actual descriptors
used do not overflow the maximum descriptor size, so check this instead.
Clamp the returned max_send_sge value to be no bigger than what
query_device returns for the max_sge to avoid confusing hapless users,
even if the hardware is capable of handling a few more s/g entries.
This bug caused NFS/RDMA mounts to fail when the server adapter used
the mlx4 driver.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function
This is the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When I merged bbf8eed1 ("IB/mlx4: Add support for resizing CQs") I
changed things around so that mlx4_ib_alloc_cq_buf() and
mlx4_ib_free_cq_buf() were used everywhere they could be. However, I
screwed up the number of entries passed into mlx4_ib_alloc_cq_buf()
in a couple places -- the function bumps the number of entries
internally, so the caller shouldn't add 1 as well.
Passing a too-big value for the number of entries to mlx4_ib_free_cq_buf()
can cause the cleanup to go off the end of an array and corrupt
allocator state in interesting ways.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Extend the mlx4_cq_resize() API with a way to set the "collapsed" flag
for the CQ being created.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In addition to mlx4_ib, there will be ethernet and FC consumers of
mlx4_core, so move the code for managing kernel doorbells into the
core module to avoid having to duplicate this multiple times.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This converts the main ib_device to use struct device instead of struct
class_device as class_device is going away.
Signed-off-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: Roland Dreier <rolandd@cisco.com>
Cc: Sean Hefty <sean.hefty@intel.com>
Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The mlx4_ib driver is stable enough for production use, so bump the
version number to 1.0 to indicate this.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the QP was moved to another state (such as SQE) by the hardware,
then after this change the user won't have to set the IBV_QP_CUR_STATE
mask in order to execute modify QP in order to recover from this state.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than have build_mlx_header() return a negative value on failure
and the length of the segments it builds on success, add a pointer
parameter to return the length and return 0 on success. This matches
the calling convention used for build_lso_seg() and generates slightly
smaller code -- eg, on 64-bit x86:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-22 (-22)
function old new delta
mlx4_ib_post_send 2023 2001 -22
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO. The create_flags member will also be useful for XRC
and ehca low-latency QP support.
Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX devices support checksum generation and verification of TCP
and UDP packets for UD IPoIB messages. This patch checks if the HCA
supports this and sets the IB_DEVICE_UD_IP_CSUM capability flag if it
does. It implements support for handling the IB_SEND_IP_CSUM send
flag and setting the csum_ok field in receive work completions.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Ali Ayub <ali@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The struct mlx4_interface.event() method was supposed to get an enum
mlx4_dev_event, but the driver code was actually passing in the
hardware enum mlx4_event values. Fix up the callers of
mlx4_dispatch_event() so that they pass in the right type of value,
and fix up the event method in mlx4_ib so that it can handle the enum
mlx4_dev_event values.
This eliminates the need for the subtype parameter to the event
method, so remove it.
This also fixes the sparse warning
drivers/net/mlx4/intf.c:127:48: warning: mixing different enum types
drivers/net/mlx4/intf.c:127:48: int enum mlx4_event versus
drivers/net/mlx4/intf.c:127:48: int enum mlx4_dev_event
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Currently mlx4_ib_fmr_alloc() calls mlx4_mr_enable() instead of
mlx4_fmr_enable(). The two functions are equivalent at the moment, but
this is not really correct (and the change is needed to fix a bug).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX HCA supports shrinking WQEs, so that a single work request
can be made of multiple units of wqe_shift. This way, WRs can differ
in size, and do not have to be a power of 2 in size, saving memory and
speeding up send WR posting. Unfortunately, if we do this then the
wqe_index field in CQEs can't be used to look up the WR ID anymore, so
our implementation does this only if selective signaling is off.
Further, on 32-bit platforms, we can't use vmap() to make the QP
buffer virtually contigious. Thus we have to use constant-sized WRs to
make sure a WR is always fully within a single page-sized chunk.
Finally, we use WRs with the NOP opcode to avoid wrapping around the
queue buffer in the middle of posting a WR, and we set the
NoErrorCompletion bit to avoid getting completions with error for NOP
WRs. However, NEC is only supported starting with firmware 2.2.232,
so we use constant-sized WRs for older firmware. And, since MLX QPs
only support SEND, we use constant-sized WRs in this case.
When stamping during NOP posting, do stamping following setting of the
NOP WQE valid bit.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We use struct mlx4_buf for kernel QP, CQ and SRQ buffers, and the code
to look up an entry is duplicated in get_cqe_from_buf() and the QP and
SRQ versions of get_wqe(). Factor this out into mlx4_buf_offset().
This will also make it easier to switch over to using vmap() for buffers.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The string mlx4_ib_version was defined, but never used. Print out the
version once when the first device is initialized.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The firmware QUERY_ADAPTER command does not return vendor_id,
device_id, and revision_id; eliminate these fields from the query.
Initialize the rev_id field of the mlx4 device via init_node_data (MAD
IFC query), as is done in the query_device verb implementation.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than byte-swapping cqe->g_mlpath_rqpn each time we extract a
field from it, byte-swap it once into a temporary variable. This
results in smaller, better code -- eg, on 32-bit x86:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-5 (-5)
function old new delta
mlx4_ib_poll_cq 1188 1183 -5
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Fix the value of pkey_index in completions to get a valid value for
GSI QPs. Without this fix, incoming GSI packets on port 2 get an
invalid P_Key index in the completion, which prevents the MAD layer
from sending back a response, which can make the second port of
ConnectX HCAs completely useless.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Because of a typo, mlx4_ib_post_send() takes the same lock rq.lock as
mlx4_ib_post_recv(). Correct the code so the intended sq.lock is
taken when posting a send.
Noticed by Yossi Leybovitch and pointed out by Jack Morgenstein from
Mellanox.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add sanity checks to send queue sizes passed in from userspace. The
minimum sq stride value below is taken from the MT25408 PRM (section
11.10, Table 306, log_sq_stride definition).
Without this check, userspace can submit arbitrarily large/small
values for the number of WQEs and the stride, which can crash the
kernel.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Implement FMRs for mlx4. This is an adaptation of code from mthca.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Write MTT entries directly to ICM from the driver (eliminating use of
WRITE_MTT command). This reduces the number of FW commands needed to
register an MR by at least a factor of 2 and speeds up memory
registration significantly. This code will also be used to implement
FMRs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
display the following device information under /sys/class/infiniband/mlx4_X:
board_id, fw_ver, hw_rev, hca_type.
This patch makes this information available to userspace utilities
such as ibstat and ibv_devinfo.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_srq_query() returns a big-endian 16-bit value through an int *,
which screws up sparse checking. Fix this so that a CPU-endian value
is returned.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use a __set_data_seg() helper in mlx4_ib_post_recv() too; in addition
to making the code easier to read, this also allows gcc to generate
better code -- on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-8 (-8)
function old new delta
mlx4_ib_post_recv 359 351 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is an addendum to commit 0e6e7416 ("IB/mlx4: Handle new FW
requirement for send request prefetching"). We also need to handle
prefetch marking properly for S/G segments, or else the HCA may end up
processing S/G segments that are not fully written and end up sending
the wrong data. This can actually cause data corruption in practice,
especially on systems with relatively slow CPUs (where the HCA is more
likely to prefetch while the CPU is in the middle of writing a work
request into memory).
We write S/G segments in reverse order into the WQE, in order to
guarantee that the first dword of all cachelines containing S/G
segments is written last (overwriting the headroom invalidation
pattern). The entire cacheline will thus contain valid data when the
invalidation pattern is overwritten.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
A stray semicolon makes us inadvertently ignore the value of err.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code has a cut-and-paste error and returns IB_WC_SEND when it
should return IB_WC_RDMA_READ.
Signed-off-by: Vu Pham <vu@mellanox.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The error handling code at err_wrid in create_qp_common() does not
handle a userspace QP attached to an SRQ correctly, since it ends up
in the else clause of the if statement. This means it tries to
kfree() the uninitialized qp->sq.wrid and qp->rq.wrid pointers. Fix
this so we only free the wrid arrays for kernel QPs.
Pointed out by Michael S. Tsirkin <mst@dev.mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Temporarily allocated struct mlx4_qp_context *context is leaked by
several error paths. The patch takes advantage of the return value
'err' being preinitialized to -EINVAL.
Spotted by Coverity (CID 1768).
Signed-off-by: Florin Malita <fmalita@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor code to set remote address, atomic and datagram segments out of
mlx4_ib_post_send() into small helper functions. This doesn't change
the generated code in any significant way, and makes the source easier
on the eyes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor code to set data segment entries out of mlx4_ib_post_send()
into set_data_seg(). This cleans up the code and lets the compiler do
a better job -- on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-16 (-16)
function old new delta
mlx4_ib_post_send 1598 1582 -16
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Return the receive queue sizes for both userspace QPs and kernel Qps
(not just kernel QPs) from mlx4_ib_query_qp(). Also zero the send
queue sizes for userspace QPs to avoid a possible information leak,
and set the max_inline_data for kernel QPs to 0 since inline sends are
not supported for kernel QPs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When clearing the ib_ah_attr parameter in to_ib_ah_attr(), use sizeof
*ib_ah_attr instead of sizeof *path. This is the same bug as was
fixed for mthca in 99d4f22e ("IB/mthca: Use correct structure size in
call to memset()"), but the code was cut and pasted into mlx4 before the
fix was merged.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a QP is in the INIT state, the sched_queue field hasn't been given
to the firmware yet, so the firmware cannot return the value when the QP
is queried. To handle this, use the port number that is saved in the
driver's QP data structure.
Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Correct the mask used to get the flow label, since the field is 20 bits,
not 24 bits.
Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Change Kconfig objects from "menu, config" into "menuconfig" so
that the user can disable the whole feature without having to
enter the menu first.
Signed-off-by: Jan Engelhardt <jengelh@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Get the maximum message size from the device capabilities returned
from the QUERY_DEV_CAP firmware command, rather than hard-coding 2 GB.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib.h uses struct mutex, so although <linux/mutex.h> seems to be
pulled in indirectly by one of the headers it includes, the right
thing is to include <linux/mutex.h> directly.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We need to keep a spare entry in the SRQ so that there always is a
next WQE available when posting receives (so that we can tell the
difference between a full queue and an empty queue). So subtract 1
from the value HW gives us before reporting the limit on SRQ entries
to consumers.
Found by Mellanox QA.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Inline data segments in send WQEs are not allowed to cross a 64 byte
boundary. We use inline data segments to hold the UD headers for MLX
QPs (QP0 and QP1). A send with GRH on QP1 will have a UD header that
is too big to fit in a single inline data segment without crossing a
64 byte boundary, so split the header into two inline data segments.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Upcoming firmware introduces command interface revision 3, which
changes the way port capabilities are queried and set. Update the
driver to handle both the new and old command interfaces by adding a
new MLX4_FLAG_OLD_PORT_CMDS that it is set after querying the firmware
interface revision and then using the correct interface based on the
setting of the flag.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When compacting CQ entries, we need to set the correct value of the
ownership bit in case the value is different between the index we copy
the CQE from and the index we copy it to.
Found by Ronni Zimmerman of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The calculation of max_inline_data in set_kernel_sq_size() is bogus,
since it doesn't take into account the fact that inline segments may
not cross a 64-byte boundary, and hence multiple inline segments will
probably need to be used to post large inline sends.
We don't support inline sends for kernel QPs anyway, so there's no
point in doing this calculation anyway, since the field is just zeroed
out a little later. So just delete the bogus calculation.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
New ConnectX firmware introduces FW command interface revision 2,
which requires that for each QP, a chunk of send queue entries (the
"headroom") is kept marked as invalid, so that the HCA doesn't get
confused if it prefetches entries that haven't been posted yet. Add
code to the driver to do this, and also update the user ABI so that
userspace can request that the prefetcher be turned off for userspace
QPs (we just leave the prefetcher on for all kernel QPs).
Unfortunately, marking send queue entries this way is confuses older
firmware, so we change the driver to allow only FW command interface
revisions 2. This means that users will have to update their firmware
to work with the new driver, but the firmware is changing quickly and
the old firmware has lots of other bugs anyway, so this shouldn't be too
big a deal.
Based on a patch from Jack Morgenstein <jackm@dev.mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Doing max(1, foo) where foo is u32 generates a warning, because 1 is a
signed constant. Fix this by using 1U instead.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Cast the increment added to wq->tail when send completions are
processed to u16 to avoid using wrong values caused by standard
integer promotions.
The same bug was fixed in libmlx4 by Eli Cohen <eli@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
QPs attached to an SRQ must never have their own RQ, and QPs not
attached to SRQs must have an RQ with at least 1 entry. Enforce all
of this in set_rq_size().
Based on a patch by Eli Cohen <eli@mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The code in __mlx4_ib_modify_qp() overwrites context->params1 after
the RNR retry parameter is ORed in, which results in the RNR retry
parameter always being set to 0. Fix this by moving where we OR in
the value to later in the function, after the initial assignment of
context->params1.
Found by the Mellanox firmware group.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We need to initialize the owner bit of send queue WQEs to hardware
ownership whenever the QP is modified from reset to init, not just
when the QP is first allocated. This avoids having the hardware
process stale WQEs when the QP is moved to reset but not destroyed and
then modified to init again.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If a QP is attached to a shared receive queue (SRQ), then it doesn't
have a receive queue (RQ). So don't allocate an RQ doorbell (or map a
doorbell from userspace for userspace QPs) for that QP.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Pass the number of WQEs for the send queue and their size from userspace
to the kernel to avoid having to keep the QP size calculations in sync
between the kernel driver and libmlx4. This fixes a bug seen with the
current mlx4_ib driver and current libmlx4 caused by a difference in the
calculated sizes for SQ WQEs. Also, this gives more flexibility for
userspace to experiment with using multiple WQE BBs for a single SQ WQE.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
wr->opcode is invalid if it's >= ARRAY_SIZE(mlx4_ib_opcode), not just
strictly >.
This was spotted by the Coverity checker (CID 1643).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
According to the IB spec, a QP can be moved from RESET back to RESET
or to the ERROR state, but mlx4 firmware does not support this and
returns an error if we try. Fix the RESET to RESET transition by
just returning 0 without doing anything, and fix RESET to ERROR by
moving the QP from RESET to INIT with dummy parameters and then
transitioning from INIT to ERROR.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
max_qp_dest_rdma is already in natural units - no need to shift. This
was discovered by a test that deliberately requests more outstanding
atomic operation than the device supports.
Found by Sagi Rotem at Mellanox.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
uar_lock spinlock was used in mlx4_ib_cq_arm without being initialized
(this only affects 32-bit archs, because uar_lock is not used on
64-bit archs and MLX4_INIT_DOORBELL_LOCK() is a NOP).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add an InfiniBand driver for Mellanox ConnectX adapters. Because
these adapters can also be used as ethernet NICs and Fibre Channel
HBAs, the driver is split into two modules:
mlx4_core: Handles low-level things like device initialization and
processing firmware commands. Also controls resource allocation
so that the InfiniBand, ethernet and FC functions can share a
device without stepping on each other.
mlx4_ib: Handles InfiniBand-specific things; plugs into the
InfiniBand midlayer.
Signed-off-by: Roland Dreier <rolandd@cisco.com>