Commit Graph

29 Commits

Author SHA1 Message Date
Eric Dumazet 93d05d4a32 net: provide generic busy polling to all NAPI drivers
NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:42 -05:00
Eric Dumazet d64b5e85bf net: add netif_tx_napi_add()
netif_tx_napi_add() is a variant of netif_napi_add()

It should be used by drivers that use a napi structure
to exclusively poll TX.

We do not want to add this kind of napi in napi_hash[] in following
patches, adding generic busy polling to all NAPI drivers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:41 -05:00
Carol L Soto b0f6446377 net/mlx4_core: Fix unintialized variable used in error path
The uninitialized value name in mlx4_en_activate_cq was used in order
to print an error message. Fixing it by replacing it with cq->vector.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Carol L Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 16:40:27 -07:00
Ido Shamay de1618034a net/mlx4_core: Move affinity hints to mlx4_core ownership
Now that EQs management is in the sole responsibility of mlx4_core,
the IRQ affinity hints configuration should be in its hands as well.
request_irq is called only once by the first consumer (maybe mlx4_ib),
so mlx4_en passes the affinity mask too late. We also need to request
vectors according to the cores we want to run on.

mlx4_core distribution of IRQs to cores is straight forward,
EQ(i)->IRQ will set affinity hint to core i.
Consumers need to request EQ vectors, according to their cores
considerations (NUMA).

Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-30 23:35:34 -07:00
Matan Barak c66fa19c40 net/mlx4: Add EQ pool
Previously, mlx4_en allocated EQs and used them exclusively.
This affected RoCE performance, as applications which are
events sensitive were limited to use only the legacy EQs.

Change that by introducing an EQ pool. This pool is managed
by mlx4_core. EQs are assigned to ports (when there are limited
number of EQs, multiple ports could be assigned to the same EQs).

An exception to this rule is the ASYNC EQ which handles various events.

Legacy EQs are completely removed as all EQs could be shared.

When a consumer (mlx4_ib/mlx4_en) requests an EQ, it asks for
EQ serving on a specific port. The core driver calculates which
EQ should be assigned to that request.

Because IRQs are shared between IB and Ethernet modules, their
names only include the PCI device BDF address.

Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Ido Shamay <idos@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-30 23:35:34 -07:00
Yishai Hadas 872bf2fb69 net/mlx4_core: Maintain a persistent memory for mlx4 device
Maintain a persistent memory that should survive reset flow/PCI error.
This comes as a preparation for coming series to support above flows.

Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25 14:43:13 -08:00
Amir Vadai 858e6c3210 net/mlx4_en: cq->irq_desc wasn't set in legacy EQ's
Fix a regression introduced by commit 35f6f45 ("net/mlx4_en: Don't use
irq_affinity_notifier to track changes in IRQ affinity map").
When core is started in legacy EQ's (number of IRQ's < rx rings), cq->irq_desc
was NULL.  This caused a kernel crash under heavy traffic - when having more
than rx NAPI budget completions.
Fixed to have it set for both EQ modes.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16 23:28:32 -07:00
Amir Vadai bb273617a6 net/mlx4_en: IRQ affinity hint is not cleared on port down
Need to remove affinity hint at mlx4_en_deactivate_cq() and not at
mlx4_en_destroy_cq() - since affinity_mask might be free'd while still
being used by procfs.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-02 18:29:23 -07:00
Amir Vadai 35f6f45368 net/mlx4_en: Don't use irq_affinity_notifier to track changes in IRQ affinity map
IRQ affinity notifier can only have a single notifier - cpu_rmap
notifier. Can't use it to track changes in IRQ affinity map.
Detect IRQ affinity changes by comparing CPU to current IRQ affinity map
during NAPI poll thread.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Hutchings <ben@decadent.org.uk>
Fixes: 2eacc23 ("net/mlx4_core: Enforce irq affinity changes immediatly")
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-02 18:29:23 -07:00
Yuval Atias 9e311e77a8 net/mlx4_en: Use affinity hint
The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.

We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it.  To accomplish this, we use cpumask_set_cpu_local_first(), that
sets the affinity hint according the following policy:
First it maps IRQs to “close” numa cores.  If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.

Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 14:58:16 -07:00
David S. Miller 96b2e73c54 Revert "net/mlx4_en: Use affinity hint"
This reverts commit 70a640d0da.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-02 00:18:48 -07:00
Yuval Atias 70a640d0da net/mlx4_en: Use affinity hint
The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.

We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it.  To accomplish this, we use cpumask_set_cpu_local_first(), that
sets the affinity hint according the following policy:
First it maps IRQs to “close” numa cores.  If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.

Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-01 19:16:29 -07:00
Joe Perches 1a91de2883 mellanox: Logging message cleanups
Use a more current logging style.

o Coalesce formats
o Add missing spaces for coalesced formats
o Align arguments for modified formats
o Add missing newlines for some logging messages
o Use DRV_NAME as part of format instead of %s, DRV_NAME to
  reduce overall text.
o Use ..., ##__VA_ARGS__ instead of args... in macros
o Correct a few format typos
o Use a single line message where appropriate

Signed-off-by: Joe Perches <joe@perches.com>
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-08 23:42:02 -04:00
Chris Mason c98235cb85 mlx4_en: don't use napi_synchronize inside mlx4_en_netpoll
The mlx4 driver is triggering schedules while atomic inside
mlx4_en_netpoll:

	spin_lock_irqsave(&cq->lock, flags);
	napi_synchronize(&cq->napi);
		^^^^^ msleep here
	mlx4_en_process_rx_cq(dev, cq, 0);
	spin_unlock_irqrestore(&cq->lock, flags);

This was part of a patch by Alexander Guller from Mellanox in 2011,
but it still isn't upstream.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
Acked-By: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-16 15:02:42 -04:00
Eugenia Emantayev 0276a33061 net/mlx4_en: Add NAPI support for transmit side
Add NAPI for TX side,
implement its support and provide NAPI callback.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 19:04:44 -05:00
Eugenia Emantayev 163561a4e2 net/mlx4_en: Datapath structures are allocated per NUMA node
For each RX/TX ring and its CQ, allocation is done on a NUMA node that
corresponds to the core that the data structure should operate on.
The assumption is that the core number is reflected by the ring index.
The affected allocations are the ring/CQ data structures,
the TX/RX info and the shared HW/SW buffer.
For TX rings, each core has rings of all UPs.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07 19:22:48 -05:00
Eugenia Emantayev 41d942d56c net/mlx4_en: Datapath resources allocated dynamically
Currently all TX/RX rings and completion queues are part of the
netdev priv structure and are allocated statically. This patch
will change the priv to hold only arrays of pointers and therefore
all TX/RX rings and completetion queues will be allocated
dynamically. This is in preparation for NUMA aware allocations.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07 19:22:48 -05:00
Amir Vadai 9e77a2b837 net/mlx4_en: Add Low Latency Socket (LLS) support
Add basic support for LLS.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-19 18:32:16 -07:00
Amir Vadai ec693d4701 net/mlx4_en: Add HW timestamping (TS) support
The patch allows to enable/disable HW timestamping for incoming and/or
outgoing packets. It adds and initializes all structs and callbacks
needed by kernel TS API.
To enable/disable HW timestamping appropriate ioctl should be used.
Currently HWTSTAMP_FILTER_ALL/NONE and HWTSAMP_TX_ON/OFF only are
supported.
When enabling TS on receive flow - VLAN stripping will be disabled.
Also were made all relevant changes in RX/TX flows to consider TS request
and plant HW timestamps into relevant structures.
mlx4_ib was fixed to compile with new mlx4_cq_alloc() signature.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-24 16:30:14 -04:00
Or Gerlitz 08ff32352d mlx4: 64-byte CQE/EQE support
ConnectX-3 devices can use either 64- or 32-byte completion queue
entries (CQEs) and event queue entries (EQEs).  Using 64-byte
EQEs/CQEs performs better because each entry is aligned to a complete
cacheline.  This patch queries the HCA's capabilities, and if it
supports 64-byte CQEs and EQES the driver will configure the HW to
work in 64-byte mode.

The 32-byte vs 64-byte mode is global per HCA and not per CQ or EQ.

Since this mode is global, userspace (libmlx4) must be updated to work
with the configured CQE size, and guests using SR-IOV virtual
functions need to know both EQE and CQE size.

In case one of the 64-byte CQE/EQE capabilities is activated, the
patch makes sure that older guest drivers that use the QUERY_DEV_FUNC
command (e.g as done in mlx4_core of Linux 3.3..3.6) will notice that
they need an update to be able to work with the PPF. This is done by
changing the returned pf_context_behaviour not to be zero any more. In
case none of these capabilities is activated that value remains zero
and older guest drivers can run OK.

The SRIOV related flow is as follows

1. the PPF does the detection of the new capabilities using
   QUERY_DEV_CAP command.

2. the PPF activates the new capabilities using INIT_HCA.

3. the VF detects if the PPF activated the capabilities using
   QUERY_HCA, and if this is the case activates them for itself too.

Note that the VF detects that it must be aware to the new PF behaviour
using QUERY_FUNC_CAP.  Steps 1 and 2 apply also for native mode.

User space notification is done through a new field introduced in
struct mlx4_ib_ucontext which holds device capabilities for which user
space must take action. This changes the binary interface so the ABI
towards libmlx4 exposed through uverbs is bumped from 3 to 4 but only
when **needed** i.e. only when the driver does use 64-byte CQEs or
future device capabilities which must be in sync by user space. This
practice allows to work with unmodified libmlx4 on older devices (e.g
A0, B0) which don't support 64-byte CQEs.

In order to keep existing systems functional when they update to a
newer kernel that contains these changes in VF and userspace ABI, a
module parameter enable_64b_cqe_eqe must be set to enable 64-byte
mode; the default is currently false.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
2012-11-26 10:19:17 -08:00
Amir Vadai 1eb8c695bd net/mlx4_en: Add accelerated RFS support
Use RFS infrastructure and flow steering in HW to keep CPU
affinity of rx interrupts and application per TCP stream.

A flow steering filter is added to the HW whenever the RFS
ndo callback is invoked by core networking code.

Because the invocation takes place in interrupt context, the
actual setup of HW is done using workqueue. Whenever new filter
is added, the driver checks for expiry of existing filters.

Since there's window in time between the point where the core
RFS code invoked the ndo callback, to the point where the HW
is configured from the workqueue context, the 2nd, 3rd etc
packets from that stream will cause the net core to invoke
the callback again and again.

To prevent inefficient/double configuration of the HW, the filters
are kept in a database which is indexed using hash function to enable
fast access.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 08:34:37 -07:00
Amir Vadai d9236c3f10 {NET,IB}/mlx4: Add rmap support to mlx4_assign_eq
Enable callers of mlx4_assign_eq to supply a pointer to cpu_rmap.
If supplied, the assigned IRQ is tracked using rmap infrastructure.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-19 08:34:37 -07:00
Yevgeny Petrilin e22979d96a mlx4_en: Moving to Interrupts for TX completions
Moving to interrupts instead of polling fpr TX completions
Avoiding situations where skb can be held in by the driver for
a long time (till timer expires).
The change is also necessary for supporting BQL.

Removing comp_lock that was required because we could handle TX
completions from several contexts: Interrupts, timer, polling.
Now there is only interrupts

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 22:34:02 -04:00
David S. Miller 455ffa607f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-01-02 18:56:49 -05:00
Yevgeny Petrilin cd3109d23c mlx4_en: nullify cq->vector field when closing completion queue
Caused loss of connectivity when changing ring size.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-30 17:15:41 -05:00
Yevgeny Petrilin f0ab34f011 net/mlx4_en: using non collapsed CQ on TX
Moving to regular Completion Queue implementation (not collapsed)
Completion for each transmitted packet is written to new entry.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-27 17:17:04 -05:00
Alexander Guller fe0af03c69 mlx4_en: Removing reserve vectors
Fixed a bug where ring size change caused insufficient memory
upon driver restart due to unreleased EQs.

Signed-off-by: Alexander Guller <alexg@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-09 23:42:57 -04:00
Alexander Guller 76532d0c7e mlx4_en: Assigning TX irq per ring
Until now only RX rings used irq per ring
and TX used only one per port.
>From now on, both of them will use the
irq per ring while RX & TX ring[i] will
use the same irq.

Signed-off-by: Alexander Guller <alexg@mellanox.co.il>
Signed-off-by: Sharon Cohen <sharonc@mellanox.co.il>
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-09 23:42:56 -04:00
Jeff Kirsher 5a2cc190eb mlx4: Move the Mellanox driver
Moves the Mellanox driver into drivers/net/ethernet/mellanox/ and
make the necessary Kconfig and Makefile changes.

CC: Roland Dreier <roland@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2011-08-11 02:41:35 -07:00