Pull cgroup updates from Tejun Heo:
"Several noteworthy changes.
- Parav's rdma controller is finally merged. It is very straight
forward and can limit the abosolute numbers of common rdma
constructs used by different cgroups.
- kernel/cgroup.c got too chubby and disorganized. Created
kernel/cgroup/ subdirectory and moved all cgroup related files
under kernel/ there and reorganized the core code. This hurts for
backporting patches but was long overdue.
- cgroup v2 process listing reimplemented so that it no longer
depends on allocating a buffer large enough to cache the entire
result to sort and uniq the output. v2 has always mangled the sort
order to ensure that users don't depend on the sorted output, so
this shouldn't surprise anybody. This makes the pid listing
functions use the same iterators that are used internally, which
have to have the same iterating capabilities anyway.
- perf cgroup filtering now works automatically on cgroup v2. This
patch was posted a long time ago but somehow fell through the
cracks.
- misc fixes asnd documentation updates"
* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
kernfs: fix locking around kernfs_ops->release() callback
cgroup: drop the matching uid requirement on migration for cgroup v2
cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
cgroup: misc cleanups
cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
cgroup: track migration context in cgroup_mgctx
cgroup: cosmetic update to cgroup_taskset_add()
rdmacg: Fixed uninitialized current resource usage
cgroup: Add missing cgroup-v2 PID controller documentation.
rdmacg: Added documentation for rdmacg
IB/core: added support to use rdma cgroup controller
rdmacg: Added rdma cgroup controller
cgroup: fix a comment typo
cgroup: fix RCU related sparse warnings
cgroup: move namespace code to kernel/cgroup/namespace.c
cgroup: rename functions for consistency
cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
cgroup: separate out cgroup1_kf_syscall_ops
cgroup: refactor mount path and clearly distinguish v1 and v2 paths
cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
...
Add new entry to the RDMA-CM configfs that allows users
to select default TOS for RDMA-CM QPs.
This is useful for users that want to control the TOS for legacy
applications without changing their code.
Application that sets the TOS explicitly using the rdma_set_option
API will continue to work as expected, meaning overriding the configfs
value.
CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Majd Dibbiny <majd@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Added support APIs for IB core to register/unregister every IB/RDMA
device with rdma cgroup for tracking rdma resources.
IB core registers with rdma cgroup controller.
Added support APIs for uverbs layer to make use of rdma controller.
Added uverbs layer to perform resource charge/uncharge functionality.
Added support during query_device uverb operation to ensure it
returns resource limits by honoring rdma cgroup configured limits.
Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Shared mlx5 updates with net stack (will drop out on merge if Dave's
tree has already been merged)
- Driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe
- Debug cleanups
- New connection rejection helpers
- SRP updates
- Various misc fixes
- New paravirt driver from vmware
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYUbAPAAoJELgmozMOVy/dMXcP/iuG5MNzfN8Ny1JftyBQGWg3
cqoQ2OLj9CsXjwVB+5EqbcZHRZY852lKONaLoDKkIOx4YAXO2YuIKOp944vN7EQx
96wfqzT1F5jzAcy5mYZXgLaStGFDAwejKMqeHd0LfJj3OEtemGnVPWYzyqSQmSKo
dzJraS1Z9GIRppzU5WaRpB9PtRBkqIqGJ5vZ0EKLGhed5hYY5r0iMJB0GfriMRDO
lJ4UUVfpsAoLPnqDBFH6IMn2V2UeAw9IR5zNa1mrM1RBfvt/uYTxrw1w3p9WoaNs
GRodhk4DCeAfeyqzVPNBLyXZ4Zq4FzGe3UWM4qysJ1RR4oFNw9Cuw0Fqk8mrfznr
7hv5TpGIckRZiKf8l6e+qLirF0qGtXJg29j2vPVQI9i5nSj95g1agA81PnLQlLLb
flWyxeMj81my7lfMHN1xcV6pqPEKMCOysZmfcvVfJd2XxpjuVD7ekl/YXWp8o8kU
YPdQMqPD626XsD8VpPdMszb9FPmx0JD0HEv+Y1rIFX8JegEI+c3H2X0dqC27T/Ou
FEPWOy025EgHm0Fh/7eIzkG6tjZ4JHoCugJAcxNZGj2XW4eB6r5vY8UwJ8iQRv+n
PVYHiy0UoIRePh0mrdOSSphGZMi/GO/DsqKwCtAMEK43WqZQju6wR7QSIGkh66mp
4uSHJqpf3YEYylxGMhk3
=QeGy
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"This is the complete update for the rdma stack for this release cycle.
Most of it is typical driver and core updates, but there is the
entirely new VMWare pvrdma driver. You may have noticed that there
were changes in DaveM's pull request to the bnxt Ethernet driver to
support a RoCE RDMA driver. The bnxt_re driver was tentatively set to
be pulled in this release cycle, but it simply wasn't ready in time
and was dropped (a few review comments still to address, and some
multi-arch build issues like prefetch() not working across all
arches).
Summary:
- shared mlx5 updates with net stack (will drop out on merge if
Dave's tree has already been merged)
- driver updates: cxgb4, hfi1, hns-roce, i40iw, mlx4, mlx5, qedr, rxe
- debug cleanups
- new connection rejection helpers
- SRP updates
- various misc fixes
- new paravirt driver from vmware"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (210 commits)
IB: Add vmw_pvrdma driver
IB/mlx4: fix improper return value
IB/ocrdma: fix bad initialization
infiniband: nes: return value of skb_linearize should be handled
MAINTAINERS: Update Intel RDMA RNIC driver maintainers
MAINTAINERS: Remove Mitesh Ahuja from emulex maintainers
IB/core: fix unmap_sg argument
qede: fix general protection fault may occur on probe
IB/mthca: Replace pci_pool_alloc by pci_pool_zalloc
mlx5, calc_sq_size(): Make a debug message more informative
mlx5: Remove a set-but-not-used variable
mlx5: Use { } instead of { 0 } to init struct
IB/srp: Make writing the add_target sysfs attr interruptible
IB/srp: Make mapping failures easier to debug
IB/srp: Make login failures easier to debug
IB/srp: Introduce a local variable in srp_add_one()
IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build
IB/multicast: Check ib_find_pkey() return value
IPoIB: Avoid reading an uninitialized member variable
IB/mad: Fix an array index check
...
The function ib_resolve_eth_dmac() requires struct qp_attr * and
qp_attr_mask as parameters while the function might be useful to resolve
dmac for address handles. This patch changes the signature of the
function so it can be used in the flow of creating an address handle.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Convert rdma_is_upper_dev_rcu, handle_netdev_upper and
ipoib_get_net_dev_match_addr to the new upper device walk API.
This is just a code conversion; no functional change is intended.
v2
- removed typecast of data
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is an assumption that rdmacm is used only between nodes
in the same IB subnet, this why ARP resolution can be used to turn
IP to GID in rdmacm.
When dealing with IB communication between subnets this assumption
is no longer valid. ARP resolution will get us the next hop device
address and not the peer node's device address.
To solve this issue, we will check user space if it can provide the
GID of the peer node, and fail if not.
We add a sequence number to identify each request and fill in the GID
upon answer from userspace.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Move SA ibnl client registration to ib_core module init.
This will allow us to register a single client to handle
all RDMA_NL_LS operations and make it SA independent.
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Consolidate ib_sa into ib_core, this commit eliminates
ib_sa.ko and makes it part of ib_core.ko
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Consolidate ib_mad into ib_core, this commit eliminates
ib_mad.ko and makes it part of ib_core.ko
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
IB address resolution is declared as a module (ib_addr.ko) which loads
itself before IB core module (ib_core.ko).
It causes to the scenario where IB netlink which is initialized by IB
core can't be used by ib_addr.ko.
In order to solve it, we are converting ib_addr.ko to be part of
IB core module.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.
In order to use the configfs, one needs to mount it and
mkdir <IB device name> inside rdma_cm directory.
The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, cma users can't increase or decrease the cma reference
count. This is necassary when setting cma attributes (like the
default GID type) in order to avoid use-after-free errors.
Adding cma_ref_dev and cma_deref_dev APIs.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to validate the route, we need an easy way to check if a
net-device belongs to our RDMA device. Move this helper function
to a header file in order to make this check easier.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Reviewed-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
In order to support multiple GID types, we need to store the gid_type
with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT
GID table entries shall have a "GID type" attribute that denotes the L3
Address type". The currently supported GID is IB_GID_TYPE_IB which is
also RoCE v1 GID type.
This implies that gid_type should be added to roce_gid_table meta-data.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, vlan id and source MAC were used from QP attributes. Since
the net device is now stored in the GID attributes, they could be used
instead of getting this information from the QP attributes.
IB_QP_SMAC, IB_QP_ALT_SMAC, IB_QP_VID and IB_QP_ALT_VID were removed
because there is no known libibverbs that uses them.
This commit also modifies the vendors (mlx4, ocrdma) drivers in order
to use the new approach.
ocrdma driver changes were done by Somnath Kotur <Somnath.Kotur@Avagotech.Com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Sometime consumers might want to search for a GID in a specific port.
For example, when a WC arrives and we want to search the GID
that matches that port - it's better to search only the relevant
port.
Exposing and renaming ib_cache_gid_find_by_port in order to match
the naming convention of the module.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
RoCE GIDs are based on IP addresses configured on Ethernet net-devices
which relate to the RDMA (RoCE) device port.
Currently, each of the low-level drivers that support RoCE (ocrdma,
mlx4) manages its own RoCE port GID table. As there's nothing which is
essentially vendor specific, we generalize that, and enhance the RDMA
core GID cache to do this job.
In order to populate the GID table, we listen for events:
(a) netdev up/down/change_addr events - if a netdev is built onto
our RoCE device, we need to add/delete its IPs. This involves
adding all GIDs related to this ndev, add default GIDs, etc.
(b) inet events - add new GIDs (according to the IP addresses)
to the table.
For programming the port RoCE GID table, providers must implement
the add_gid and del_gid callbacks.
RoCE GID management requires us to state the associated net_device
alongside the GID. This information is necessary in order to manage
the GID table. For example, when a net_device is removed, its
associated GIDs need to be removed as well.
RoCE mandates generating a default GID for each port, based on the
related net-device's IPv6 link local. In contrast to the GID based on
the regular IPv6 link-local (as we generate GID per IP address),
the default GID is also available when the net device is down (in
order to support loopback).
Locking is done as follows:
The patch modify the GID table code both for new RoCE drivers
implementing the add_gid/del_gid callbacks and for current RoCE and
IB drivers that do not. The flows for updating the table are
different, so the locking requirements are too.
While updating RoCE GID table, protection against multiple writers is
achieved via mutex_lock(&table->lock). Since writing to a table
requires us to find an entry (possible a free entry) in the table and
then modify it, this mutex protects both the find_gid and write_gid
ensuring the atomicity of the action.
Each entry in the GID cache is protected by rwlock. In RoCE, writing
(usually results from netdev notifier) involves invoking the vendor's
add_gid and del_gid callbacks, which could sleep.
Therefore, an invalid flag is added for each entry. Updates for RoCE are
done via a workqueue, thus sleeping is permitted.
In IB, updates are done in write_lock_irq(&device->cache.lock), thus
write_gid isn't allowed to sleep and add_gid/del_gid are not called.
When passing net-device into/out-of the GID cache, the device
is always passed held (dev_hold).
The code uses a single work item for updating all RDMA devices,
following a netdev or inet notifier.
The patch moves the cache from being a client (which was incorrect,
as the cache is part of the IB infrastructure) to being explicitly
initialized/freed when a device is registered/removed.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This gets rid of the weird in-between state where struct ib_device
was allocated but the kobject didn't work.
Consequently ib_device_release is now guaranteed to be called in
all situations and we needn't duplicate its kfrees on error paths.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Existing user space applications provide only IBoE L3 address
attributes to the kernel when they issue a modify QP modify. To work
with them and let such apps (plus kernel consumers which don't use the
RDMA-CM) keep working transparently under the IBoE GID IP addressing
changes, add an Eth L2 address resolution helper.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Add a new parameter to ib_register_device() so that low-level device
drivers can pass in a pointer to a callback function that will be
called for each port that is registered in sysfs. This allows
low-level device drivers to create files in
/sys/class/infiniband/<hca>/ports/<N>/
without having to poke through the internals of the RDMA sysfs handling.
There is no need for an unregister function since the kobject
reference will go to zero when ib_unregister_device() is called.
Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Move the InfiniBand headers from drivers/infiniband/include to include/rdma.
This allows InfiniBand-using code to live elsewhere, and lets us remove the
ugly EXTRA_CFLAGS include path from the InfiniBand Makefiles.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!