In order to support BPF-to-BPF calls in offloaded programs, the nfp
driver must collect information about the distinct subprograms: namely,
the number of subprograms composing the complete program and the stack
depth of those subprograms. The latter in particular is non-trivial to
collect, so we copy those elements from the kernel verifier via the
newly added post-verification hook. The struct nfp_prog is extended to
store this information. Stack depths are stored in an array of dedicated
structs.
Subprogram start indexes are not collected. Instead, meta instructions
associated to the start of a subprogram will be marked with a flag in a
later patch.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for support for BPF to BPF calls in offloaded programs,
rename the "stack_depth" field of the struct nfp_prog as
"stack_frame_depth". This is to make it clear that the field refers to
the maximum size of the current stack frame (as opposed to the maximum
size of the whole stack memory).
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for BPF-to-BPF calls in offloaded programs, add a new
function attribute to the struct bpf_prog_offload_ops so that drivers
supporting eBPF offload can hook at the end of program verification, and
potentially extract information collected by the verifier.
Implement a minimal callback (returning 0) in the drivers providing the
structs, namely netdevsim and nfp.
This will be useful in the nfp driver, in later commits, to extract the
number of subprograms as well as the stack depth for those subprograms.
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
mlx5 core driver and ethernet netdev updates, please note there is a small
devlink releated update to allow extack argument to eswitch operations.
From Eli Britstein,
1) devlink: Add extack argument to the eswitch related operations
2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks
3) net/mlx5e: Add extack messages for TC offload failures
From Eran Ben Elisha,
4) mlx5e: Add counter for aRFS rule insertion failures
From Feras Daoud
5) Fast teardown support for mlx5 device
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the FW pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been scatter
to memory
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJbtU45AAoJEEg/ir3gV/o+/C4H/RHA4KImrb476EdB3VNYMqAN
dgXb+bmh6sZP+jHWqQ4c3aVeh6/T8qm4gwiSn2nVTtHEnxtCdIYljzDC1Nswczeg
pSjD1eOP7M1LpAOmBb8xdnJcX7yM7r1bTklnp2sN853WShbsDRYgZBHsBwTzx25U
ZdzL4QTLuohlG/aLrbGXMntIy45ya2fVQrnK54s18nFlgsdFjEs0mi0xaUKNBC6+
P8CTohHAxuuxmL5b+6MIYLZCdgd8cLNQFdtqbckEVw7SvcRTxfraRlyqJ0YOgTGB
TdSWnqZz2JYH29wSFbpFG8qX6GCv8FoiZ+fKzldbolHk442rrktHv3+Y7qQuZVs=
=NVks
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2018-10-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2018-10-03
mlx5 core driver and ethernet netdev updates, please note there is a small
devlink releated update to allow extack argument to eswitch operations.
From Eli Britstein,
1) devlink: Add extack argument to the eswitch related operations
2) net/mlx5e: E-Switch, return extack messages for failures in the e-switch devlink callbacks
3) net/mlx5e: Add extack messages for TC offload failures
From Eran Ben Elisha,
4) mlx5e: Add counter for aRFS rule insertion failures
From Feras Daoud
5) Fast teardown support for mlx5 device
This change introduces the enhanced version of the "Force teardown" that
allows SW to perform teardown in a faster way without the need to reclaim
all the FW pages.
Fast teardown provides the following advantages:
1- Fix a FW race condition that could cause command timeout
2- Avoid moving to polling mode
3- Close the vport to prevent PCI ACK to be sent without been scatter
to memory
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflict in net/core/rtnetlink.c, David Ahern's bug fix in 'net'
overlapped the renaming of a netlink attribute in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to the eswitch related operations.
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
When FW floods the driver with control messages try to exit the cmsg
processing loop every now and then to avoid soft lockups. Cmsg
processing is generally very lightweight so 512 seems like a reasonable
budget, which should not be exceeded under normal conditions.
Fixes: 77ece8d5f1 ("nfp: add control vNIC datapath")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Tested-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In current ABI the size of the messages carrying map elements was
statically defined to at most 16 words of key and 16 words of value
(NFP word is 4 bytes). We should not make this assumption and use
the max key and value sizes from the BPF capability instead.
To make sure old kernels don't get surprised with larger (or smaller)
messages bump the FW ABI version to 3 when key/value size is different
than 16 words.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Some apps may want to have higher MTU on the control vNIC/queue.
Allow them to set the requested MTU at init time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Up until now we only had per-vNIC BPF ABI version capabilities,
which are slightly awkward to use because bulk of the resources
and configuration does not relate to any particular vNIC. Add
a new capability for global ABI version and check the per-vNIC
version are equal to it. Assume the ABI version 2 if no explicit
version capability is present.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reserve two TLV types for feature development, and warn in the driver
if they ever leak into production.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Version bump conflict in batman-adv, take what's in net-next.
iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.
Signed-off-by: David S. Miller <davem@davemloft.net>
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.
nfp uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFP supports fairly enormous ring sizes (up to 256k descriptors).
In commit 4662717038 ("nfp: use kvcalloc() to allocate SW buffer
descriptor arrays") we have started using kvcalloc() functions to
make sure the allocation of software state arrays doesn't hit
the MAX_ORDER limit. Unfortunately, we can't use virtual mappings
for the DMA region holding HW descriptors. In case this allocation
fails instead of the generic (and fairly scary) warning/splat in
the logs print a helpful message explaining what happened and
suggesting how to fix it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report in standard netdev stats drops and errors as well as
RX multicast from the FW vNIC counters.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a bug where ipv6 tunnels would report that it is
getting offloaded to hardware but would actually be rejected
by hardware.
Fixes: b27d6a95a7 ("nfp: compile flower vxlan tunnel set actions")
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously we only checked if the vlan id field is present when trying
to match a vlan tag. The vlan id and vlan pcp field should be treated
independently.
Fixes: 5571e8c9f2 ("nfp: extend flower matching capabilities")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As you are already in a tasklet, it is unnecessary to call spin_lock_bh.
Signed-off-by: jun qian <hangdianqj@163.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VXLAN and GRE FW features have to currently be both advertised
for the driver to enable them. Separate the handling.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With the accesses to rtsyms now all going via special helpers
we can easily make sure the driver is not reading past the
end of the symbol.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For ease of debug preface all error messages with the name
of the symbol which caused them. Use the same message format
for existing messages while at it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return the error and report value through the output param.
Fixes: 640917dd81 ("nfp: support access to absolute RTsyms")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid leaking a running timer we need to wait for the
posted reconfigs after netdev is unregistered. In common
case the process of deinitializing the device will perform
synchronous reconfigs which wait for posted requests, but
especially with VXLAN ports being actively added and removed
there can be a race condition leaving a timer running after
adapter structure is freed leading to a crash.
Add an explicit flush after deregistering and for a good
measure a warning to check if timer is running just before
structures are freed.
Fixes: 3d780b926a ("nfp: add async reconfiguration mechanism")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the RTsym users access the size via the helper, which
takes care of special handling of absolute symbols.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support in nfpcore for reading the absolute RTsyms.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all users of RTsym to the new set of helpers which
handle all targets correctly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make nfp_rtsym_{read,write}_le() and nfp_rtsym_map() use the new
target resolution helpers to allow accessing in-cache symbols.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Align nfp_cpp_map_area() with other CPP-level APIs and pass
encoded cpp_id/dest rather than target, action, domain tuple.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RTsyms may have special encodings for more complex symbol types.
For example symbols which are placed in external memory unit's
cache directly, constants or local memory. Add set of helpers
which will check for those special encodings and handle them
correctly.
For now only add direct cache accesses, we don't have a need to
access the other ones in foreseeable future.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add error prints to CPP target encoding/decoding logic, otherwise
it's quite hard to pin point the reasons why read or write
operations fail.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We will soon need the MU locality field offset much more
often than just for decoding MIP address. Save it in nfp_cpp
for quick access. Note that we can already reuse the target
config from nfp_cpp, no need to do the XPB read.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Francois H. Theron <francois.theron@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a switch statement instead of ifs for code dependent
on chip version. While at it make sure we fail for unknown
chip revisions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add NFP5000 to supported chips, the chip is backward compatible
with NFP4000 and NFP6000, so core PCIe code needs to handle it
the same way as 4k and 6k.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In multi-host scenarios Management FW may allocate MAC addresses
at runtime, we have to use the indirect lookup to find them.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Management FW can adjust some of the information in the HWinfo table
at runtime. In some cases reading the table directly will not yield
correct results. Add a NSP command for looking up information.
Up until now we weren't making use of any of the values which may
get adjusted.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To enable easier FW distribution NFP can now automatically
select between FW stored on the flash and loaded from the
kernel.
If FW loading policy is set to auto it will compare the
versions of FW from the host and from the flash and load
the newer one. If FW type doesn't match (e.g. one advanced
application vs another) the FW from the host takes precedence,
unless one of them is the basic NIC firmware, in which case
the non-basic-NIC FW is selected.
This automatic selection mechanism requires we inform user
what the verdict was. Print a message to the logs explaining
the decision and the reason.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flash may contain a default NFP application FW. This application
can either be put there by the user (with ethtool -f) or shipped
with the card. If file system FW is not found, attempt to load
this flash stored app FW.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is already a fair number of arguments to nfp_nsp_command()
family of functions. Encapsulate them into structures to make
adding new ones easier. No functional changes.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 90b73b77d0, list_head is no longer needed.
Now we just need to convert the list iteration to array
iteration for drivers.
Fixes: 90b73b77d0 ("net: sched: change action API to use array of pointers to actions")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove 'Return:' information from functions which no longer
return a value. Also update name and return types of nfp_nffw_info
access functions.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new layer for matching on geneve options. This allows
offloading filters configured to match geneve with options.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce new push geneve option action. This allows offloading
filters configured to entunnel geneve with options.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The addition of FLOW_DISSECTOR_KEY_ENC_IP to TC flower means that the ToS
and TTL of the tunnel header can now be matched on.
Extend the NFP tunnel match function to include these new fields.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TTL for encapsulating headers in IPv4 UDP tunnels is taken from a
route lookup. Modify this to first check if a user has specified a TTL to
be used in the TC action.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-08-07
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add cgroup local storage for BPF programs, which provides a fast
accessible memory for storing various per-cgroup data like number
of transmitted packets, etc, from Roman.
2) Support bpf_get_socket_cookie() BPF helper in several more program
types that have a full socket available, from Andrey.
3) Significantly improve the performance of perf events which are
reported from BPF offload. Also convert a couple of BPF AF_XDP
samples overto use libbpf, both from Jakub.
4) seg6local LWT provides the End.DT6 action, which allows to
decapsulate an outer IPv6 header containing a Segment Routing Header.
Adds this action now to the seg6local BPF interface, from Mathieu.
5) Do not mark dst register as unbounded in MOV64 instruction when
both src and dst register are the same, from Arthur.
6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier
instructions on arm64 for the AF_XDP sample code, from Brian.
7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts
over from Python 2 to Python 3, from Jeremy.
8) Enable BTF build flags to the BPF sample code Makefile, from Taeung.
9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee.
10) Several improvements to the README.rst from the BPF documentation
to make it more consistent with RST format, from Tobin.
11) Replace all occurrences of strerror() by calls to strerror_r()
in libbpf and fix a FORTIFY_SOURCE build error along with it,
from Thomas.
12) Fix a bug in bpftool's get_btf() function to correctly propagate
an error via PTR_ERR(), from Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for adjust_tail. There are no FW changes needed but add
a FW capability just in case there would be any issue with previously
released FW, or we will have to change the ABI in the future.
The helper is trivial and shouldn't be used too often so just inline
the body of the function. We add the delta to locally maintained
packet length register and check for overflow, since add of negative
value must overflow if result is positive. Note that if delta of 0
would be allowed in the kernel this trick stops working and we need
one more instruction to compare lengths before and after the change.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The BTF conflicts were simple overlapping changes.
The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.
Signed-off-by: David S. Miller <davem@davemloft.net>
'app' is dereferenced before used for the devlink trace point.
In case FW is buggy and sends a control message to a VF queue
we should make sure app is not NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Function nfp_flower_repr_get_type_and_port expects an enum nfp_repr_type
return value but, if the repr type is unknown, returns a value of type
enum nfp_flower_cmsg_port_type. This means that if FW encodes the port
ID in a way the driver does not understand instead of dropping the frame
driver may attribute it to a physical port (uplink) provided the port
number is less than physical port count.
Fix this and ensure a net_device of NULL is returned if the repr can not
be determined.
Fixes: 1025351a88 ("nfp: add flower app")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FW can put constraints on map element size to maximize resource
use and efficiency. When user attempts offload of a map which
does not fit into those constraints an informational message is
printed to kernel logs to inform user about the reason offload
failed. Map offload does not have access to any advanced error
reporting like verifier log or extack. There is also currently
no way for us to nicely expose the FW capabilities to user
space. Given all those constraints we should make sure log
messages are as informative as possible. Improve them.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Record perf maps by map ID, not raw kernel pointer. This helps
with debug messages, because printing pointers to logs is frowned
upon, and makes debug easier for the users, as map ID is something
they should be more familiar with. Note that perf maps are offload
neutral, therefore IDs won't be orphaned.
While at it use a rate limited print helper for the error message.
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Control queue is fairly low latency, and requires SKB allocations,
which means we can't even reach 0.5Msps with perf events. Allow
perf events to be delivered to data queues. This allows us to not
only use multiple queues, but also receive and deliver to user space
more than 5Msps per queue (Xeon E5-2630 v4 2.20GHz, no retpolines).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
In preparation for SKB-less perf event handling make
nfp_bpf_event_output() take buffer address and length,
not SKB as parameters.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Port id 0xffffffff is reserved for control messages. Allow reception
of messages with this id on data queues. Hand off a raw buffer to
the higher layer code, without allocating SKB for max efficiency.
The RX handle can't modify or keep the buffer, after it returns
buffer is handed back over to the NIC RX free buffer list.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Representor packets are received on PF queues with special metadata tag
for demux. There is no reason to resolve the representor ID -> netdev
after the skb has been allocated. Move the code, this will allow us to
handle special FW messages without SKB allocation overhead.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use array_size() and store the size as full size_t to protect from
theoretical size overflow when handling HW descriptor rings.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7f1c684a89 ("nfp: setup xdp_rxq_info") mixed the cache
cold and cache hot data in the nfp_net_rx_ring structure (ignoring
the feedback), to try to fit the structure into 2 cache lines
after struct xdp_rxq_info was added. Now that we are about to add
a new field the structure will grow back to 3 cache lines, so
order the members correctly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use kvcalloc() instead of tmp variable + kzalloc() when allocating
SW buffer information to allow falling back to vmalloc and to protect
from theoretical integer overflow.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On machines with buggy ACPI tables or when SR-IOV is already enabled
we may not be able to set the SR-IOV VF limit in sysfs, it's not fatal
because the limit is imposed by the driver anyway. Only the sysfs
'sriov_totalvfs' attribute will be too high. Print an error to inform
user about the failure but allow probe to continue.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After device is stopped we reset the rings by moving all free buffers
to positions [0, cnt - 2], and clear the position cnt - 1 in the ring.
We then proceed to clear the read/write pointers. This means that if
we try to reset the ring again the code will assume that the next to
fill buffer is at position 0 and swap it with cnt - 1. Since we
previously cleared position cnt - 1 it will lead to leaking the first
buffer and leaving ring in a bad state.
This scenario can only happen if FW communication fails, in which case
the ring will never be used again, so the fact it's in a bad state will
not be noticed. Buffer leak is the only problem. Don't try to move
buffers in the ring if the read/write pointers indicate the ring was
never used or have already been reset.
nfp_net_clear_config_and_disable() is now fully idempotent.
Found by code inspection, FW communication failures are very rare,
and reconfiguring a live device is not common either, so it's unlikely
anyone has ever noticed the leak.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have offload replay infrastructure added by
commit 326367427c ("net: sched: call reoffload op on block callback reg")
and flows are guaranteed to be removed correctly, we can revert
commit 951a8ee6de ("nfp: reject binding to shared blocks").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously only the neighbour state was checked to decide if an offloaded
entry should be removed. However, there can be situations when the entry
is dead but still marked as valid. This can lead to dead entries not
being removed from fw tables or even incorrect data being added.
Check the entry dead bit before deciding if it should be added to or
removed from fw neighbour tables.
Fixes: 8e6a9046b6 ("nfp: flower vxlan neighbour offload")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-20
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add sharing of BPF objects within one ASIC: this allows for reuse of
the same program on multiple ports of a device, and therefore gains
better code store utilization. On top of that, this now also enables
sharing of maps between programs attached to different ports of a
device, from Jakub.
2) Cleanup in libbpf and bpftool's Makefile to reduce unneeded feature
detections and unused variable exports, also from Jakub.
3) First batch of RCU annotation fixes in prog array handling, i.e.
there are several __rcu markers which are not correct as well as
some of the RCU handling, from Roman.
4) Two fixes in BPF sample files related to checking of the prog_cnt
upper limit from sample loader, from Dan.
5) Minor cleanup in sockmap to remove a set but not used variable,
from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow program sharing between netdevs of the same NFP ASIC.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Create a higher-level entity to represent a device/ASIC to allow
programs and maps to be shared between device ports. The extra
work is required to make sure we don't destroy BPF objects as
soon as the netdev for which they were loaded gets destroyed,
as other ports may still be using them. When netdev goes away
all of its BPF objects will be moved to other netdevs of the
device, and only destroyed when last netdev is unregistered.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently we have two lists of offloaded objects - programs and maps.
Netdevice unregister notifier scans those lists to orphan objects
associated with device being unregistered. This puts unnecessary
(even if negligible) burden on all netdev unregister calls in BPF-
-enabled kernel. The lists of objects may potentially get long
making the linear scan even more problematic. There haven't been
complaints about this mechanisms so far, but it is suboptimal.
Instead of relying on notifiers, make the few BPF-capable drivers
register explicitly for BPF offloads. The programs and maps will
now be collected per-device not on a global list, and only scanned
for removal when driver unregisters from BPF offloads.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF code should unregister the offload capabilities from .ndo_uninit(),
to make sure the operation is atomic with unlist_netdevice(). Plumb
the init/uninit NDOs for vNICs and representors.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-15
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various different arm32 JIT improvements in order to optimize code emission
and make the JIT code itself more robust, from Russell.
2) Support simultaneous driver and offloaded XDP in order to allow for advanced
use-cases where some work is offloaded to the NIC and some to the host. Also
add ability for bpftool to load programs and maps beyond just the cgroup case,
from Jakub.
3) Add BPF JIT support in nfp for multiplication as well as division. For the
latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
4) Add BTF pretty print functionality to bpftool in plain and JSON output
format, from Okash.
5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
6) Add a TCP BPF callback for listening sockets which is triggered right after
the socket transitions to TCP_LISTEN state, from Andrey.
7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
tree and prints all attached programs, from Roman.
8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
packets, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Split handling of offloaded and driver programs completely. Since
offloaded programs always come with XDP_FLAGS_HW_MODE set in reality
there could be no sharing, anyway, programs would only be installed
in driver or in hardware. Splitting the handling allows us to install
programs in HW and in driver at the same time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time. Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present). We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.
Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Basic operations drivers perform during xdp setup and query can
be moved to helpers in the core. Encapsulate program and flags
into a structure and add helpers. Note that the structure is
intended as the "main" program information source in the driver.
Most drivers will additionally place the program pointer in their
fast path or ring structures.
The helpers don't have a huge impact now, but they will
decrease the code duplication when programs can be installed
in HW and driver at the same time. Encapsulating the basic
operations in helpers will hopefully also reduce the number
of changes to drivers which adopt them.
Helpers could really be static inline, but they depend on
definition of struct netdev_bpf which means they'd have
to be placed in netdevice.h, an already 4500 line header.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw). Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports. Remove
prog_attached member.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
getnstimeofday64 is deprecated in favor of the ktime_get() family of
functions. The direct replacement would be ktime_get_real_ts64(),
but I'm picking the basic ktime_get() instead:
- using a ktime_t simplifies the code compared to timespec64
- using monotonic time instead of real time avoids issues caused
by a concurrent settimeofday() or during a leap second adjustment.
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAltBPacUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vwx+hAAvzTP6o3VOtgNK7lm3nOBuzfykgCv
TFhXP2yeDItWDBLDpWX7wjs2657W3Sjrpw6FyVIYvsoMKKRuOYeZ6ChDieG5ZgTj
oxav5U6TlDHoF0fNq0LWfv78lP6+++7/6yaer6j9xDksVqE4/zxlFcExxuszhZlC
8ptJ54ORn92RdfRHCDptA4PFReNlQWNw3bpKpGxu8xj0TN/sYPN3ggHfnmiEyGMZ
8/KLNOzGrhoqztaBPyRoG3GhU1hpicT+fWzg11Li8hP1solQDht+S3mnCC1IxqdI
JSU7/jhti4nUV56qE7QzYzdsVbTFcItGn0WImklIFCt7htp4Rms0wN+QCmwhtjKM
0WH02gRDip4CcYcPZGeGaeexWJFEScamFK1yxxDV7059KsoQKUN1Sm+R7y9xORYw
nQAHPTO/nv02Xo+ADAYzV0aBPD7fEvFaWtdXAuLocVWVj3eiEqp8ftoYmuWym6T3
gHWt9Dod8olj3t9bkR1MWZ121Ar5MsobgJSF30smEx8VMKv+4Xx8eXvq0FCuUQwT
s/WLTPqK31Sqjii0xe1wO8g0yexCVaBpXJB/e9PpYf/+ICItG1DHLz0Ygw01QWVK
Tv7HTAofdO42kChHjErB6GbzSuxDxSVCPN26bwkZu0eV91uKiLKA7wLUv8+qSXlZ
lK98Eypy0Ti7pvA=
=IOF/
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.18-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- Fix a use-after-free in the endpoint code (Dan Carpenter)
- Stop defaulting CONFIG_PCIE_DW_PLAT_HOST to yes (Geert Uytterhoeven)
- Fix an nfp regression caused by a change in how we limit the number
of VFs we can enable (Jakub Kicinski)
- Fix failure path cleanup issues in the new R-Car gen3 PHY support
(Marek Vasut)
- Fix leaks of OF nodes in faraday, xilinx-nwl, xilinx (Nicholas Mc
Guire)
* tag 'pci-v4.18-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
nfp: stop limiting VFs to 0
PCI/IOV: Reset total_VFs limit after detaching PF driver
PCI: faraday: Add missing of_node_put()
PCI: xilinx-nwl: Add missing of_node_put()
PCI: xilinx: Add missing of_node_put()
PCI: endpoint: Use after free in pci_epf_unregister_driver()
PCI: controller: dwc: Do not let PCIE_DW_PLAT_HOST default to yes
PCI: rcar: Clean up PHY init on failure
PCI: rcar: Shut the PHY down in failpath
As we are doing JIT, we would want to use the advanced version of the
reciprocal divide (reciprocal_value_adv) to trade performance with host.
We could reduce the required ALU instructions from 4 to 2 or 1.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
NFP doesn't have integer divide instruction, this patch use reciprocal
algorithm (the basic one, reciprocal_div) to emulate it.
For each u32 divide, we would need 11 instructions to finish the operation.
7 (for multiplication) + 4 (various ALUs) = 11
Given NFP only supports multiplication no bigger than u32, we'd require
divisor and dividend no bigger than that as well.
Also eBPF doesn't support signed divide and has enforced this on C language
level by failing compilation. However LLVM assembler hasn't enforced this,
so it is possible for negative constant to leak in as a BPF_K operand
through assembly code, we reject such cases as well.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per
step, therefore we need 2 steps for u16 and 4 steps for u32.
We also need one start instruction to initialize the sequence and one or
two instructions to fetch the result depending on either you need the high
halve of u32 multiplication.
For ALU64, if either operand is beyond u32's value range, we reject it. One
thing to note, if the source operand is BPF_K, then we need to check "imm"
field directly, and we'd reject it if it is negative. Because for ALU64,
"imm" (with s32 type) is expected to be sign extended to s64 which NFP mul
doesn't support. For ALU32, it is fine for "imm" be negative though,
because the result is 32-bits and here is no difference on the low halve
of result for signed/unsigned mul, so we will get correct result.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
NFP verifier hook is coping range information of the shift amount for
indirect shift operation so optimized shift sequences could be generated.
We want to use range info to do more things. For example, to decide whether
multiplication and divide are supported on the given range.
This patch simply let NFP verifier hook to copy range info for all operands
of all ALU operands.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The two fields are a copy of umin and umax info of bpf_insn->src_reg
generated by verifier.
Rename to make their meaning clear.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-07-03
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various improvements to bpftool and libbpf, that is, bpftool build
speed improvements, missing BPF program types added for detection
by section name, ability to load programs from '.text' section is
made to work again, and better bash completion handling, from Jakub.
2) Improvements to nfp JIT's map read handling which allows for optimizing
memcpy from map to packet, from Jiong.
3) New BPF sample is added which demonstrates XDP in combination with
bpf_perf_event_output() helper to sample packets on all CPUs, from Toke.
4) Add a new BPF kselftest case for tracking connect(2) BPF hooks
infrastructure in combination with TFO, from Andrey.
5) Extend the XDP/BPF xdp_rxq_info sample code with a cmdline option to
read payload from packet data in order to use it for benchmarking.
Also for '--action XDP_TX' option implement swapping of MAC addresses
to avoid drops on some hardware seen during testing, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple overlapping changes in stmmac driver.
Adjust skb_gro_flush_final_remcsum function signature to make GRO list
changes in net-next, as per Stephen Rothwell's example merge
resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-07-01
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) A bpf_fib_lookup() helper fix to change the API before freeze to
return an encoding of the FIB lookup result and return the nexthop
device index in the params struct (instead of device index as return
code that we had before), from David.
2) Various BPF JIT fixes to address syzkaller fallout, that is, do not
reject progs when set_memory_*() fails since it could still be RO.
Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was
an issue, and a memory leak in s390 JIT found during review, from
Daniel.
3) Multiple fixes for sockmap/hash to address most of the syzkaller
triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(),
a missing sock_map_release() routine to hook up to callbacks, and a
fix for an omitted bucket lock in sock_close(), from John.
4) Two bpftool fixes to remove duplicated error message on program load,
and another one to close the libbpf object after program load. One
additional fix for nfp driver's BPF offload to avoid stopping offload
completely if replace of program failed, from Jakub.
5) Couple of BPF selftest fixes that bail out in some of the test
scripts if the user does not have the right privileges, from Jeffrin.
6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set
where we need to set the flag that some of the test cases are expected
to fail, from Kleber.
7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF
since it has no relation to it and lirc2 users often have configs
without cgroups enabled and thus would not be able to use it, from Sean.
8) Fix a selftest failure in sockmap by removing a useless setrlimit()
call that would set a too low limit where at the same time we are
already including bpf_rlimit.h that does the job, from Yonghong.
9) Fix BPF selftest config with missing missing NET_SCHED, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the NFP fw only supports L3/L4 hashing so rejects the offload of
filters that output to LAG ports implementing other hash algorithms. Team,
however, uses a BPF function for the hash that is not defined. To support
Team offload, accept hashes that are defined as 'unknown' (only Team
defines such hash types). In this case, use the NFP default of L3/L4
hashing for egress port selection.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extract the tos and the tunnel flags from the tunnel key and offload these
action fields. Only the checksum and tunnel key flags are implemented in
fw so reject offloads of other flags. The tunnel key flag is always
considered set in the fw so enforce that it is set in the rule. Note that
the compulsory setting of the tunnel key flag and optional setting of
checksum is inline with how tc currently generates ipv4 udp tunnel
actions.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously the ttl for ipv4 udp tunnels was set to the namespace default.
Modify this to attempt to extract the ttl from a full route lookup on the
tunnel destination. If this is not possible then resort to the default.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware will automatically update csum in headers when a set action has
been performed. This means we could in the driver ignore the explicit
checksum action when performing a set action.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We used to leave bus-info in ethtool driver info empty for
representors in case multi-PCIe-to-single-host cards make
the association between PCIe device and NFP many to one.
It seems these attempts are futile, we need to link the
representors to one PCIe device in sysfs to get consistent
naming, plus devlink uses one PCIe as a handle, anyway.
The multi-PCIe-to-single-system support won't be clean,
if it ever comes.
Turns out some user space (RHEL tests) likes to read bus-info
so just populate it.
While at it remove unnecessary app NULL-check, representors
are spawned by an app, so it must exist.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use napi_consume_skb() in nfp_net_tx_complete() to get bulk free.
Pass 0 as budget for ctrl queue completion since it runs out of
a tasklet.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFP NAPI handling will only complete the TXed packets when called
with budget of 0, implement ndo_poll_controller by scheduling NAPI
on all TX queues.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On some platforms with broken ACPI tables we may not have access
to the Serial Number PCIe capability. This capability is crucial
for us for switchdev operation as we use serial number as switch ID,
and for communication with management FW where interface ID is used.
If we can't determine the Serial Number we have to fail device probe.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After user changes the ring count statistics for deactivated
rings disappear from ethtool -S output. This causes loss of
information to the user and means that ethtool stats may not
add up to interface stats. Always expose counters from all
the rings. Note that we allocate at most num_possible_cpus()
rings so number of rings should be reasonable.
The alternative of only listing stats for rings which were
ever in use could be confusing.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before 8d85a7a4f2 ("PCI/IOV: Allow PF drivers to limit total_VFs to 0"),
pci_sriov_set_totalvfs(pdev, 0) meant "we can enable TotalVFs virtual
functions". After 8d85a7a4f2, it means "we can't enable *any* VFs".
That broke this scenario where nfp intends to remove any limit on the
number of VFs that can be enabled:
nfp_pci_probe
nfp_pcie_sriov_read_nfd_limit
nfp_rtsym_read_le("nfd_vf_cfg_max_vfs", &err)
pci_sriov_set_totalvfs(pf->pdev, 0) # if FW didn't expose a limit
...
# userspace writes N to sysfs "sriov_numvfs":
sriov_numvfs_store
pci_sriov_get_totalvfs # now returns 0
return -ERANGE
Prior to 8d85a7a4f2, pci_sriov_get_totalvfs() returned TotalVFs, but it
now returns 0.
Remove the pci_sriov_set_totalvfs(pdev, 0) calls so we don't limit the
number of VFs that can be enabled.
Fixes: 8d85a7a4f2 ("PCI/IOV: Allow PF drivers to limit total_VFs to 0")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Map read has been supported on NFP, this patch enables optimization
for memcpy from map to packet.
This patch also fixed one latent bug which will cause copying from
unexpected address once memcpy for map pointer enabled. The fixed
code path was not exercised before.
Reported-by: Mary Pham <mary.pham@netronome.com>
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
sizeof() will return unsigned value so in the error check
negative error code will be always larger than sizeof().
Fixes: a0d8e02c35 ("nfp: add support for reading nffw info")
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TC shared blocks allow multiple qdiscs to be grouped together and filters
shared between them. Currently the chains of filters attached to a block
are only flushed when the block is removed. If a qdisc is removed from a
block but the block still exists, flow del messages are not passed to the
callback registered for that qdisc. For the NFP, this presents the
possibility of rules still existing in hw when they should be removed.
Prevent binding to shared blocks until the kernel can send per qdisc del
messages when block unbinds occur.
tcf_block_shared() was not used outside of the core until now, so also
add an empty implementation for builds with CONFIG_NET_CLS=n.
Fixes: 4861738775 ("net: sched: introduce shared filter blocks infrastructure")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously it was not possible to distinguish between mpls ether types and
other ether types. This leads to incorrect classification of offloaded
filters that match on mpls ether type. For example the following two
filters overlap:
# tc filter add dev eth0 parent ffff: \
protocol 0x8847 flower \
action mirred egress redirect dev eth1
# tc filter add dev eth0 parent ffff: \
protocol 0x0800 flower \
action mirred egress redirect dev eth2
The driver now correctly includes the mac_mpls layer where HW stores mpls
fields, when it detects an mpls ether type. It also sets the MPLS_Q bit to
indicate that the filter should match mpls packets.
Fixes: bb055c198d ("nfp: add mpls match offloading support")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stopping offload completely if replace of program failed dates
back to days of transparent offload. Back then we wanted to
silently fall back to the in-driver processing. Today we mark
programs for offload when they are loaded into the kernel, so
the transparent offload is no longer a reality.
Flags check in the driver will only allow replace of a driver
program with another driver program or an offload program with
another offload program.
When driver program is replaced stopping offload is a no-op,
because driver program isn't offloaded. When replacing
offloaded program if the offload fails the entire operation
will fail all the way back to user space and we should continue
using the old program. IOW when replacing a driver program
stopping offload is unnecessary and when replacing offloaded
program - it's a bug, old program should continue to run.
In practice this bug would mean that if offload operation was to
fail (either due to FW communication error, kernel OOM or new
program being offloaded but for a different netdev) driver
would continue reporting that previous XDP program is offloaded
but in fact no program will be loaded in hardware. The failure
is fairly unlikely (found by inspection, when working on the code)
but it's unpleasant.
Backport note: even though the bug was introduced in commit
cafa92ac25 ("nfp: bpf: add support for XDP_FLAGS_HW_MODE"),
this fix depends on commit 441a33031f ("net: xdp: don't allow
device-bound programs in driver mode"), so this fix is sufficient
only in v4.15 or newer. Kernels v4.13.x and v4.14.x do need to
stop offload if it was transparent/opportunistic, i.e. if
XDP_FLAGS_HW_MODE was not set on running program.
Fixes: cafa92ac25 ("nfp: bpf: add support for XDP_FLAGS_HW_MODE")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently the default case is not handled, which with future command
introductions would introduce a warning. So handle it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Various netfilter fixlets from Pablo and the netfilter team.
2) Fix regression in IPVS caused by lack of PMTU exceptions on local
routes in ipv6, from Julian Anastasov.
3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia.
4) Don't crash on poll in TLS, from Daniel Borkmann.
5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things
including Avahi mDNS. From Bart Van Assche.
6) Missing of_node_put in qcom/emac driver, from Yue Haibing.
7) We lack checking of the TCP checking in one special case during SYN
receive, from Frank van der Linden.
8) Fix module init error paths of mac80211 hwsim, from Johannes Berg.
9) Handle 802.1ad properly in stmmac driver, from Elad Nachman.
10) Must grab HW caps before doing quirk checks in stmmac driver, from
Jose Abreu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: stmmac: Run HWIF Quirks after getting HW caps
neighbour: skip NTF_EXT_LEARNED entries during forced gc
net: cxgb3: add error handling for sysfs_create_group
tls: fix waitall behavior in tls_sw_recvmsg
tls: fix use-after-free in tls_push_record
l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()
l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
mlxsw: spectrum_router: Align with new route replace logic
mlxsw: spectrum_router: Allow appending to dev-only routes
ipv6: Only emit append events for appended routes
stmmac: added support for 802.1ad vlan stripping
cfg80211: fix rcu in cfg80211_unregister_wdev
mac80211: Move up init of TXQs
mac80211_hwsim: fix module init error paths
cfg80211: initialize sinfo in cfg80211_get_station
nl80211: fix some kernel doc tag mistakes
hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
rds: avoid unenecessary cong_update in loop transport
l2tp: clean up stale tunnel or session in pppol2tp_connect's error path
...
We need to release the refcnt on dst_entry in the route table, otherwise
we will leak the route.
Fixes: 8e6a9046b6 ("nfp: flower vxlan neighbour offload")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
.ndo_get_phys_port_name was recently extended to support multi-vNIC
FWs. These are firmwares which can have more than one vNIC per PF
without associated port (e.g. Adaptive Buffer Management FW), therefore
we need a way of distinguishing the vNICs. Unfortunately, it's too
late to make flower use the same naming. Flower users may depend on
.ndo_get_phys_port_name returning -EOPNOTSUPP, for example the name
udev gave the PF vNIC was just the bare PCI device-based name before
the change, and will have 'nn0' appended after.
To ensure flower's vNIC doesn't have phys_port_name attribute, add
a flag to vNIC struct and set it in flower code. New projects will
not set the flag adhere to the naming scheme from the start.
Fixes: 51c1df83e3 ("nfp: assign vNIC id as phys_port_name of vNICs which are not ports")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are gathering software statistics on per-ring basis.
.ndo_get_stats64 handler adds the rings up. Unfortunately
we are currently only adding up active rings, which means
that if user decreases the number of active rings the
statistics from deactivated rings will no longer be counted
and total interface statistics may go backwards.
Always sum all possible rings, the stats are allocated
statically for max number of rings, so we don't have to
worry about them being removed. We could add the stats
up when user changes the ring count, but it seems unnecessary..
Adding up inactive rings will be very quick since no datapath
will be touching them.
Fixes: 164d1e9e5d ("nfp: add support for ethtool .set_channels")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Once upon a time nfp_cpp_resource_find() took a name parameter,
which could be any user-chosen string. Resources are identified
by a CRC32 hash of a 8 byte string, so we had to pad user input
with zeros to make sure CRC32 gave the correct result.
Since then nfp_cpp_resource_find() was made to operate on allocated
resources only (struct nfp_resource). We kzalloc those so there is
no need to pad the strings and use memcmp.
This avoids a GCC 8 stringop-truncation warning:
In function ‘nfp_cpp_resource_find’,
inlined from ‘nfp_resource_try_acquire’ at .../nfpcore/nfp_resource.c:153:8,
inlined from ‘nfp_resource_acquire’ at .../nfpcore/nfp_resource.c:206:9:
.../nfpcore/nfp_resource.c:108:2: warning: strncpy’ output may be truncated copying 8 bytes from a string of length 8 [-Wstringop-truncation]
strncpy(name_pad, res->name, sizeof(name_pad));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack argument to reload, port_split and port_unsplit operations.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report the stat diff to make sure MQ stats add up to child stats.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for MQ offload and setting RED parameters
on queue-by-queue basis.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allocate the PF representor as multi-queue to allow setting
the configuration per-queue.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a handful of statistics exposing some internal details
of the implementation. Expose those via ethtool.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow nfp apps to add extra ethtool stats.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report basic and extended RED statistics back to TC.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload simple RED configurations. For now support only DCTCP
like scenarios where min and max are the same.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Queue levels for simple ECN marking are stored in _abi_nfd_out_q_lvls_X
symbol, where X is the PCIe PF id. Find out the location of that symbol
and add helpers for modifying it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ABM NIC FW has a cut-through mode where the PCIe queuing
is bypassed, thus working like our standard NIC FWs. Use this
mode by default and only enable queuing in switchdev mode where
users can configure it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers are using a bare number inside phys_port_name
as VF id and OpenStack's regexps will pick it up. We can't
use a bare number for your vNICs, prefix the names with 'n'.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After recent change we started returning 0 from
ndo_get_phys_port_name for VFs. The name parameter for
ndo_get_phys_port_name is not initialized by the stack so
this can lead to a crash. We should have kept returning
-EOPNOTSUPP in the first place.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the egress device of an offloaded rule is a LAG port, then encode the
output port to the NFP with a LAG identifier and the offloaded group ID.
A prelag action is also offloaded which must be the first action of the
series (although may appear after other pre-actions - e.g. tunnels). This
causes the FW to check that it has the necessary information to output to
the requested LAG port. If it does not, the packet is sent to the kernel
before any other actions are applied to it.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the control message handler to synchronize offloaded group config
with that of the kernel. Such messages are sent from fw to driver and
feature the following 3 flags:
- Data: an attached cmsg could not be processed - store for retransmission
- Xon: FW can accept new messages - retransmit any stored cmsgs
- Sync: full sync requested so retransmit all kernel LAG group info
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Monitor LAG events via the NETDEV_CHANGEUPPER/NETDEV_CHANGELOWERSTATE
notifiers to maintain a list of offloadable groups. Sync these groups with
HW via a delayed workqueue to prevent excessive re-configuration. When the
workqueue is triggered it may generate multiple control messages for
different groups. These messages are linked via a batch ID and flags to
indicate a new batch and the end of a batch.
Update private data in each repr to track their LAG lower state flags. The
state of a repr is used to determine the active netdevs that can be
offloaded. For example, in active-backup mode, we only offload the netdev
currently active.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a bitmap to each flower repr to track its state if it is enslaved by a
bond. This LAG state may be different to the port state - for example, the
port may be up but LAG state may be down due to the selection in an
active/backup bond.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check if the fw contains the _abi_flower_balance_sync_enable symbol. If it
does then write a 1 to this indicating that the driver is willing to
receive NIC to kernel LAG related control messages.
If the write is successful, update the list of extra features supported by
the fw and add a stub to accept LAG cmsgs.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add an rtsym API function that combines the lookup of a symbol and the
writing of a value to it. Values can be written as unsigned 32 or 64 bits.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a netdev to a bond requires that its mac address can be modified.
The default eth_mac_addr is sufficient to satisfy this requirement.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2018-05-24
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).
2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.
3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.
4) Jiong Wang adds support for indirect and arithmetic shifts to NFP
5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.
6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.
7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.
8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.
9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When NFP is modelled as a switch we assign phys_port_name to respective
port(representor )s:
vNIC0 - | - PF port (pf%d) MAC/PHY (p%d[s%d]) - |E==
In most cases there is only one vNIC for communication with the switch.
If there is more than one we need to be able to identify them. Use %d
as phys_port_name of the vNICs.
We don't have to pass ID to nfp_net_debugfs_vnic_add() separately any
more.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PCI PFs can host more than one logical endpoint. In NFP terms
this means having more than one vNIC for PCIe PF. The vNICs
are usually corresponding 1:1 to Ethernet ports. In core NIC
we use the legacy idea of vNIC *being* the Ethernet port,
hence netdevs put pX(sY) in their phys_port_name, like Ethernet
ports would. When ASIC ports are fully represented we need to
be able to name different PCIe PF ports, too. Use a scheme
similar to Ethernet ports - pfXsY, for PCIe PF number X,
sub-port Y.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current control firmware does not cater too well to multi-host
applications. There is no way to check which hosts are up or
otherwise negotiate what the state of the external port (the
Ethernet port) should be. Make sure the link is up when driver
loads, and don't take it down when Ethernet port netdev is
closed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To configure buffering points we need full set of netdevs:
ASIC
user netdev -- | -- PCIe port MAC port -- | --
Configuring egrees qdiscs on user netdev configures standard
Linux TC software qdiscs, configuring PCIe port qdiscs will
provide a way of setting ASIC queuing parameters for PCIe block.
MAC port netdev egress qdiscs correspond to ASIC MAC Traffic
Manager block.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Our previous apps all assumed to use only one eswitch mode (legacy
or switchdev) without the ability to change it. ABM NIC will
want to support the switch so plumb devlink_eswitch_mode_set through.
The devlink_eswitch_mode_set is expected to spawn representors and
potentially devlink ports so it's called under big devlink lock and
pf->lock.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nfp_apps can currently associate their structures with vNICs but
not representors. Add app priv pointer to representors as well.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ABM NIC requires more complex vNIC handling, allocate
per-vNIC structure. Find out RX queue base and PCI PF id.
There will be multiple PFs sharing the same MAC port, therefore
the MAC address assigned to the vNIC must be looked up in the
HWInfo database.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a very rudimentary active buffer management NIC support.
For now it's like a core NIC without SR-IOV support. Next
commits will extend its functionality.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current code doesn't enforce length requirements on 32bit accesses
with action NFP_CPP_ACTION_RW to memory units, but if the access
is only aligned to 4 bytes as well we will fall into the explicit
access case and error out. Such accesses are correct, allow them
by lowering the width earlier.
While at it use a switch statement to improve readability.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow app FW to advertise its shared buffer pool information.
Use the per-PF mailbox to configure them from devlink.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When working with devlink-related functionality for locking reasons
it's easier to create a new mailbox per-PCI PF device than try to
use one of the netdev/vNIC mailboxes.
Define new mailbox structure and resolve its symbol during probe.
For forward compatibility allow silent truncation of mailbox command
data.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
nfp_net_pf_rtsym_read_optional() and nfp_net_pf_map_rtsym() are not
really related to networking code. Move them to the PF code and
remove the net from their names. They will soon be needed by code
outside of nfp_net_main.c anyway.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink ports can have specific flavour according to the purpose of use.
This patch extend attrs_set so the driver can say which flavour port
has. Initial flavours are:
physical, cpu, dsa
User can query this to see right away what is the purpose of each port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change existing setter for split port information into more generic
attrs setter. Alongside with that, allow to set port number and subport
number for split ports.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Code logic is similar with arithmetic right shift by constant, and NFP
get indirect shift amount through source A operand of PREV_ALU.
It is possible to fall back to logic right shift if the MSB is known to be
zero from range info, however there is no benefit to do this given logic
indirect right shift use the same number and cycle of instruction sequence.
Suppose the MSB of regX is the bit we want to replicate to fill in all the
vacant positions, and regY contains the shift amount, then we could use
single instruction to set up both.
[alu, --, regY, OR, regX]
--
NOTE: the PREV_ALU result doesn't need to write to any destination
register.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Code logic is similar with logic right shift except we also need to set
PREV_ALU result properly, the MSB of which is the bit that will be
replicated to fill in all the vacant positions.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For indirect shifts, shift amount is not specified as constant, NFP needs
to get the shift amount through the low 5 bits of source A operand in
PREV_ALU, therefore extra instructions are needed compared with shifts by
constants.
Because NFP is 32-bit, so we are using register pair for 64-bit shifts and
therefore would need different instruction sequences depending on whether
shift amount is less than 32 or not.
NFP branch-on-bit-test instruction emitter is added by this patch and is
used for efficient runtime check on shift amount. We'd think the shift
amount is less than 32 if bit 5 is clear and greater or equal than 32
otherwise. Shift amount is greater than or equal to 64 will result in
undefined behavior.
This patch also use range info to avoid generating unnecessary runtime code
if we are certain shift amount is less than 32 or not.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Don't store repr pointer to reprs array until the representor is
successfully created. This avoids message about "representor
destruction" even when it was never created. Also it cleans-up the flow.
Also, check return value after port alloc.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-05-17
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Provide a new BPF helper for doing a FIB and neighbor lookup
in the kernel tables from an XDP or tc BPF program. The helper
provides a fast-path for forwarding packets. The API supports
IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
implemented in this initial work, from David (Ahern).
2) Just a tiny diff but huge feature enabled for nfp driver by
extending the BPF offload beyond a pure host processing offload.
Offloaded XDP programs are allowed to set the RX queue index and
thus opening the door for defining a fully programmable RSS/n-tuple
filter replacement. Once BPF decided on a queue already, the device
data-path will skip the conventional RSS processing completely,
from Jakub.
3) The original sockmap implementation was array based similar to
devmap. However unlike devmap where an ifindex has a 1:1 mapping
into the map there are use cases with sockets that need to be
referenced using longer keys. Hence, sockhash map is added reusing
as much of the sockmap code as possible, from John.
4) Introduce BTF ID. The ID is allocatd through an IDR similar as
with BPF maps and progs. It also makes BTF accessible to user
space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
through BPF_OBJ_GET_INFO_BY_FD, from Martin.
5) Enable BPF stackmap with build_id also in NMI context. Due to the
up_read() of current->mm->mmap_sem build_id cannot be parsed.
This work defers the up_read() via a per-cpu irq_work so that
at least limited support can be enabled, from Song.
6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
JIT conversion as well as implementation of an optimized 32/64 bit
immediate load in the arm64 JIT that allows to reduce the number of
emitted instructions; in case of tested real-world programs they
were shrinking by three percent, from Daniel.
7) Add ifindex parameter to the libbpf loader in order to enable
BPF offload support. Right now only iproute2 can load offloaded
BPF and this will also enable libbpf for direct integration into
other applications, from David (Beckett).
8) Convert the plain text documentation under Documentation/bpf/ into
RST format since this is the appropriate standard the kernel is
moving to for all documentation. Also add an overview README.rst,
from Jesper.
9) Add __printf verification attribute to the bpf_verifier_vlog()
helper. Though it uses va_list we can still allow gcc to check
the format string, from Mathieu.
10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
is a bash 4.0+ feature which is not guaranteed to be available
when calling out to shell, therefore use a more portable variant,
from Joe.
11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
instead of relying on the gcc built-in, from Björn.
12) Fix a sock hashmap kmalloc warning reported by syzbot when an
overly large key size is used in hashmap then causing overflows
in htab->elem_size. Reject bogus attr->key_size early in the
sock_hash_alloc(), from Yonghong.
13) Ensure in BPF selftests when urandom_read is being linked that
--build-id is always enabled so that test_stacktrace_build_id[_nmi]
won't be failing, from Alexei.
14) Add bitsperlong.h as well as errno.h uapi headers into the tools
header infrastructure which point to one of the arch specific
uapi headers. This was needed in order to fix a build error on
some systems for the BPF selftests, from Sirio.
15) Allow for short options to be used in the xdp_monitor BPF sample
code. And also a bpf.h tools uapi header sync in order to fix a
selftest build failure. Both from Prashant.
16) More formally clarify the meaning of ID in the direct packet access
section of the BPF documentation, from Wang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-05-14
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix nfp to allow zero-length BPF capabilities, meaning the nfp
capability parsing loop will otherwise exit early if the last
capability is zero length and therefore driver will fail to probe
with an error such as:
nfp: BPF capabilities left after parsing, parsed:92 total length:100
nfp: invalid BPF capabilities at offset:92
Fix from Jakub.
2) libbpf's bpf_object__open() may return IS_ERR_OR_NULL() and not
just an error. Fix libbpf's bpf_prog_load_xattr() to handle that
case as well, also from Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The bpf syscall and selftests conflicts were trivial
overlapping changes.
The r8169 change involved moving the added mdelay from 'net' into a
different function.
A TLS close bug fix overlapped with the splitting of the TLS state
into separate TX and RX parts. I just expanded the tests in the bug
fix from "ctx->conf == X" into "ctx->tx_conf == X && ctx->rx_conf
== X".
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 29a5dcae27 ("nfp: flower: offload phys port MTU change") we
take encapsulation headroom into account when calculating the max allowed
MTU. This is unnecessary as the max MTU advertised by firmware should have
already accounted for encap headroom.
Subtracting headroom twice brings the max MTU below what's necessary for
some deployments.
Fixes: 29a5dcae27 ("nfp: flower: offload phys port MTU change")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some BPF capabilities carry no value, they simply indicate feature
is present. Our capability parsing loop will exit early if last
capability is zero-length because it's looking for more than 8 bytes
of data (8B is our TLV header length). Allow the last capability to
be zero-length.
This bug would lead to driver failing to probe with the following error
if the last capability FW advertises is zero-length:
nfp: BPF capabilities left after parsing, parsed:92 total length:100
nfp: invalid BPF capabilities at offset:92
Note the "parsed" and "length" values are 8 apart.
No shipping FW runs into this issue, but we can't guarantee that will
remain the case.
Fixes: 77a844ee65 ("nfp: bpf: prepare for parsing BPF FW capabilities")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
BPF has access to all internal FW datapath structures. Including
the structure containing RX queue selection. With little coordination
with the datapath we can let the offloaded BPF select the RX queue.
We just need a way to tell the datapath that queue selection has already
been done and it shouldn't overwrite it. Define a bit to tell datapath
BPF already selected a queue (QSEL_SET), if the selected queue is not
enabled (>= number of enabled queues) datapath will perform normal RSS.
BPF queue selection on the NIC can be used to replace standard
datapath RSS with fully programmable BPF/XDP RSS.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'. Thanks to Daniel for the merge resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel will now replace map fds with actual pointer before
calling the offload prepare. We can identify those pointers
and replace them with NFP table IDs instead of loading the
table ID in code generated for CALL instruction.
This allows us to support having the same CALL being used with
different maps.
Since we don't want to change the FW ABI we still need to
move the TID from R1 to portion of R0 before the jump.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add support for the perf_event_output family of helpers.
The implementation on the NFP will not match the host code exactly.
The state of the host map and rings is unknown to the device, hence
device can't return errors when rings are not installed. The device
simply packs the data into a firmware notification message and sends
it over to the host, returning success to the program.
There is no notion of a host CPU on the device when packets are being
processed. Device will only offload programs which set BPF_F_CURRENT_CPU.
Still, if map index doesn't match CPU no error will be returned (see
above).
Dropped/lost firmware notification messages will not cause "lost
events" event on the perf ring, they are only visible via device
error counters.
Firmware notification messages may also get reordered in respect
to the packets which caused their generation.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
For asynchronous events originating from the device, like perf event
output, we need to be able to make sure that objects being referred
to by the FW message are valid on the host. FW events can get queued
and reordered. Even if we had a FW message "barrier" we should still
protect ourselves from bogus FW output.
Add a reverse-mapping hash table and record in it all raw map pointers
FW may refer to. Only record neutral maps, i.e. perf event arrays.
These are currently the only objects FW can refer to. Use RCU protection
on the read side, update side is under RTNL.
Since program vs map destruction order is slightly painful for offload
simply take an extra reference on all the recorded maps to make sure
they don't disappear.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Firmware requires that the ttl value for an encapsulating ipv4 tunnel
header be included as an action field. Prior to the support of Geneve
tunnel encap (when ttl set was removed completely), ttl value was
extracted from the tunnel key. However, tests have shown that this can
still produce a ttl of 0.
Fix the issue by setting the namespace default value for each new tunnel.
Follow up patch for net-next will do a full route lookup.
Fixes: 3ca3059dc3 ("nfp: flower: compile Geneve encap actions")
Fixes: b27d6a95a7 ("nfp: compile flower vxlan tunnel set actions")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For very very old generation of the management FW Ethernet port
information table may theoretically not be available. This in
turn will cause the nfp_port structures to not be allocated.
Make sure we don't crash the kernel when there is no eth_tbl:
RIP: 0010:nfp_net_pci_probe+0xf2/0xb40 [nfp]
...
Call Trace:
nfp_pci_probe+0x6de/0xab0 [nfp]
local_pci_probe+0x47/0xa0
work_for_cpu_fn+0x1a/0x30
process_one_work+0x1de/0x3e0
Found while working with broken/development version of management FW.
Fixes: a5950182c0 ("nfp: map mac_stats and vf_cfg BARs")
Fixes: 93da7d9660 ("nfp: provide nfp_port to of nfp_net_get_mac_addr()")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-04-27
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add extensive BPF helper description into include/uapi/linux/bpf.h
and a new script bpf_helpers_doc.py which allows for generating a
man page out of it. Thus, every helper in BPF now comes with proper
function signature, detailed description and return code explanation,
from Quentin.
2) Migrate the BPF collect metadata tunnel tests from BPF samples over
to the BPF selftests and further extend them with v6 vxlan, geneve
and ipip tests, simplify the ipip tests, improve documentation and
convert to bpf_ntoh*() / bpf_hton*() api, from William.
3) Currently, helpers that expect ARG_PTR_TO_MAP_{KEY,VALUE} can only
access stack and packet memory. Extend this to allow such helpers
to also use map values, which enabled use cases where value from
a first lookup can be directly used as a key for a second lookup,
from Paul.
4) Add a new helper bpf_skb_get_xfrm_state() for tc BPF programs in
order to retrieve XFRM state information containing SPI, peer
address and reqid values, from Eyal.
5) Various optimizations in nfp driver's BPF JIT in order to turn ADD
and SUB instructions with negative immediate into the opposite
operation with a positive immediate such that nfp can better fit
small immediates into instructions. Savings in instruction count
up to 4% have been observed, from Jakub.
6) Add the BPF prog's gpl_compatible flag to struct bpf_prog_info
and add support for dumping this through bpftool, from Jiri.
7) Move the BPF sockmap samples over into BPF selftests instead since
sockmap was rather a series of tests than sample anyway and this way
this can be run from automated bots, from John.
8) Follow-up fix for bpf_adjust_tail() helper in order to make it work
with generic XDP, from Nikita.
9) Some follow-up cleanups to BTF, namely, removing unused defines from
BTF uapi header and renaming 'name' struct btf_* members into name_off
to make it more clear they are offsets into string section, from Martin.
10) Remove test_sock_addr from TEST_GEN_PROGS in BPF selftests since
not run directly but invoked from test_sock_addr.sh, from Yonghong.
11) Remove redundant ret assignment in sample BPF loader, from Wang.
12) Add couple of missing files to BPF selftest's gitignore, from Anders.
There are two trivial merge conflicts while pulling:
1) Remove samples/sockmap/Makefile since all sockmap tests have been
moved to selftests.
2) Add both hunks from tools/testing/selftests/bpf/.gitignore to the
file since git should ignore all of them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
If a flower rule has a repr both as ingress and egress port then 2
callbacks may be generated for the same rule request.
Add an indicator to each flow as to whether or not it was added from an
ingress registered cb. If so then ignore add/del/stat requests to it from
an egress cb.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When multiple netdevs are attached to a tc offload block and register for
callbacks, a rule added to the block will be propogated to all netdevs.
Previously these were detected as duplicates (based on cookie) and
rejected. Modify the rule nfp lookup function to optionally include an
ingress netdev and a host context along with the cookie value when
searching for a rule. When a new rule is passed to the driver, the netdev
the rule is to be attached to is considered when searching for dublicates.
When a stats update is received from HW, the host context is used
alongside the cookie to map to the correct host rule.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To aid debugging of performance issues caused by limited PCIe
bandwidth print the PCIe link information on probe.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFP locks record the owner when held, for PCIe devices the owner
ID will be the PCIe link number. When driver loads it should scan
known locks and if they indicate that they are held by local
endpoint but the driver doesn't hold them - release them.
Locks can be left taken for instance when kernel gets kexec-ed or
after a crash. Management FW tries to clean up stale locks too,
but it currently depends on PCIe link going down which doesn't
always happen.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Comparison instruction requires a subtraction. If the constant
is negative we are more likely to fit it into a NFP instruction
directly if we change the sign and use addition.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
There are quite a few compare instructions now, use a table
to translate BPF instruction code to NFP instruction parameters
instead of parameterizing helpers. This saves LOC and makes
future extensions easier.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
NFP instruction set can fit small immediates into the instruction.
Negative integers, however, will never fit because they will have
highest bit set. If we swap the ALU op between ADD and SUB and
negate the constant we have a better chance of fitting small negative
integers into the instruction itself and saving one or two cycles.
immed[gprB_21, 0xfffffffc]
alu[gprA_4, gprA_4, +, gprB_21], gpr_wrboth
immed[gprB_21, 0xffffffff]
alu[gprA_5, gprA_5, +carry, gprB_21], gpr_wrboth
now becomes:
alu[gprA_4, gprA_4, -, 4], gpr_wrboth
alu[gprA_5, gprA_5, -carry, 0], gpr_wrboth
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
well (only "decrease" of pointer's location is going to be supported).
changing of this pointer will change packet's size.
for nfp driver we will just calculate packet's length unconditionally
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Introduce a second skb list for handling control messages and limit the
number of allowed messages. Some control messages are considered more
crucial than others, resulting in the need for a second skb list. By
splitting the list into a separate high and low priority list we can
ensure that messages on the high list get added to the head of the list
that gets processed, this however has no functional impact. Previously
there was no limit on the number of messages allowed on the queue, this
could result in the queue growing boundlessly and eventually the host
running out of memory.
Fixes: b985f870a5 ("nfp: process control messages in workqueue in flower app")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously we processed the route ack control messages in the workqueue,
this unnecessarily loads the workqueue. We can deal with these messages
sooner as we know we are going to drop them.
Fixes: 8e6a9046b6 ("nfp: flower vxlan neighbour offload")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When waiting for an NFP mutex is interrupted print a message
to make root causing later error messages easier.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently allow signals to interrupt the wait for management FW
commands. Exiting the wait should not cause trouble, the FW will
just finish executing the command in the background and new commands
will wait for the old one to finish.
However, this may not be what users expect (Ctrl-C not actually stopping
the command). Moreover some systems routinely request link information
with signals pending (Ubuntu 14.04 runs a landscape-sysinfo python tool
from MOTD) worrying users with errors like these:
nfp 0000:04:00.0: nfp_nsp: Error -512 waiting for code 0x0007 to start
nfp 0000:04:00.0: nfp: reading port table failed -512
Make the wait for management FW responses non-interruptible.
Fixes: 1a64821c6a ("nfp: add support for service processor access")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NSP default buffer is a piece of NFP memory where additional
command data can be placed. Its format has been copied from
host buffer, but the PCIe selection bits do not make sense in
this case. If those get masked out from a NFP address - writes
to random place in the chip memory may be issued and crash the
device.
Even in the general NSP buffer case, it doesn't make sense to have the
PCIe selection bits there anymore. These are unused at the moment, and
when it becomes necessary, the PCIe selection bits should rather be
moved to another register to utilise more bits for the buffer address.
This has never been an issue because the buffer used to be
allocated in memory with less-than-38-bit-long address but that
is about to change.
Fixes: 1a64821c6a ("nfp: add support for service processor access")
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We are currently counting packets with CHECKSUM_COMPLETE as
"hw_rx_csum_ok". This is confusing. Add a new counter.
To make sure it fits in the same cacheline move the less used
error counter to a different location.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2018-03-31
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add raw BPF tracepoint API in order to have a BPF program type that
can access kernel internal arguments of the tracepoints in their
raw form similar to kprobes based BPF programs. This infrastructure
also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which
returns an anon-inode backed fd for the tracepoint object that allows
for automatic detach of the BPF program resp. unregistering of the
tracepoint probe on fd release, from Alexei.
2) Add new BPF cgroup hooks at bind() and connect() entry in order to
allow BPF programs to reject, inspect or modify user space passed
struct sockaddr, and as well a hook at post bind time once the port
has been allocated. They are used in FB's container management engine
for implementing policy, replacing fragile LD_PRELOAD wrapper
intercepting bind() and connect() calls that only works in limited
scenarios like glibc based apps but not for other runtimes in
containerized applications, from Andrey.
3) BPF_F_INGRESS flag support has been added to sockmap programs for
their redirect helper call bringing it in line with cls_bpf based
programs. Support is added for both variants of sockmap programs,
meaning for tx ULP hooks as well as recv skb hooks, from John.
4) Various improvements on BPF side for the nfp driver, besides others
this work adds BPF map update and delete helper call support from
the datapath, JITing of 32 and 64 bit XADD instructions as well as
offload support of bpf_get_prandom_u32() call. Initial implementation
of nfp packet cache has been tackled that optimizes memory access
(see merge commit for further details), from Jakub and Jiong.
5) Removal of struct bpf_verifier_env argument from the print_bpf_insn()
API has been done in order to prepare to use print_bpf_insn() soon
out of perf tool directly. This makes the print_bpf_insn() API more
generic and pushes the env into private data. bpftool is adjusted
as well with the print_bpf_insn() argument removal, from Jiri.
6) Couple of cleanups and prep work for the upcoming BTF (BPF Type
Format). The latter will reuse the current BPF verifier log as
well, thus bpf_verifier_log() is further generalized, from Martin.
7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read
and write support has been added in similar fashion to existing
IPv6 IPV6_TCLASS socket option we already have, from Nikita.
8) Fixes in recent sockmap scatterlist API usage, which did not use
sg_init_table() for initialization thus triggering a BUG_ON() in
scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and
uses a small helper sg_init_marker() to properly handle the affected
cases, from Prashant.
9) Let the BPF core follow IDR code convention and therefore use the
idr_preload() and idr_preload_end() helpers, which would also help
idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory
pressure, from Shaohua.
10) Last but not least, a spelling fix in an error message for the
BPF cookie UID helper under BPF sample code, from Colin.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Trigger a port mod message to request an MTU change on the NIC when any
physical port representor is assigned a new MTU value. The driver waits
10 msec for an ack that the FW has set the MTU. If no ack is received the
request is rejected and an appropriate warning flagged.
Rather than maintain an MTU queue per repr, one is maintained per app.
Because the MTU ndo is protected by the rtnl lock, there can never be
contention here. Portmod messages from the NIC are also protected by
rtnl so we first check if the portmod is an ack and, if so, handle outside
rtnl and the cmsg work queue.
Acks are detected by the marking of a bit in a portmod response. They are
then verfied by checking the port number and MTU value expected by the
app. If the expected MTU is 0 then no acks are currently expected.
Also, ensure that the packet headroom reserved by the flower firmware is
considered when accepting an MTU change on any repr.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the 'change_mtu' app callback to 'check_mtu'. This is called
whenever an MTU change is requested on a netdev. It can reject the
change but is not responsible for implementing it.
Introduce a new 'repr_change_mtu' app callback that is hit when the MTU
of a repr is to be changed. This is responsible for performing the MTU
change and verifying it.
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When FW responds with a message of wrong size or type make sure
the type is checked first and included in the wrong size message.
This makes it easier to figure out which FW command failed.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
NFP has a prng register, which we can read to obtain a u32 worth
of pseudo random data. Generate code for it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow atomic add to be used even when the value is not guaranteed
to fit into a 16 bit immediate. This requires the value to be pulled
as data, and therefore use of a transfer register and a context swap.
Track the information about possible lengths of the value, if it's
guaranteed to be larger than 16bits don't generate the code for the
optimized case at all.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Allow callers to control the delay slots of commands, instead of
giving them just a wait/nowait choice.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement atomic add operation for 32 and 64 bit values. Depend
on the verifier to ensure alignment. Values have to be kept in
big endian and swapped upon read/write. For now only support
atomic add of a constant.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Support calling map_delete_elem() FW helper from the datapath
programs. For JIT checks and code are basically equivalent
to map lookups. Similarly to other map helper key must be on
the stack. Different pointer types are left for future extension.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Support calling map_update_elem() from the datapath programs
by calling into FW-provided helper. Value pointer is passed
in LM pointer #2. Keeping track of old state for arg3 is not
necessary, since LM pointer #2 will be always loaded in this
case, the trivial optimization for value at the bottom of the
stack can't be done here.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a verifier helper for performing the basic state checks
before a call to a map helper.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Our implementation has restriction on stack pointers for function
calls. Move the common checks into a helper for reuse. The state
has to be encapsulated into a structure to support parameters
other than BPF_REG_2.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
We will reuse most of map call code gen for other map calls.
Rename the lookup gen function and use meta->func_id instead
of hard-coding lookup.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch is the front end of this optimisation, it detects and marks
those packet reads that could be cached. Then the optimisation "backend"
will be activated automatically.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch add the support for unaligned read offset, i.e. the read offset
to the start of packet cache area is not aligned to REG_WIDTH. In this
case, the read area might across maximum three transfer-in registers.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch assumes there is a packet data cache, and would try to read
packet data from the cache instead of from memory.
This patch only implements the optimisation "backend", it doesn't build
the packet data cache, so this optimisation is not enabled.
This patch has only enabled aligned packet data read, i.e. when the read
offset to the start of cache is REG_WIDTH aligned.
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement ip fragmentation match offloading for both IPv4 and IPv6. Allows
offloading frag, nofrag, first and nofirstfrag classification.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactored shared ip header code for IPv4 and IPv6 in match offload.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prefer the direct use of octal for permissions.
Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace
and some typing.
Miscellanea:
o Whitespace neatening around these conversions.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFP program allocation length is in bytes and NFP program length
is in instructions, fix the comparison of the two.
Fixes: 9314c442d7 ("nfp: bpf: move translation prepare to offload.c")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>