Commit Graph

739359 Commits

Author SHA1 Message Date
John Fastabend 0dcbbf6785 bpf: sockmap sample test for bpf_msg_pull_data
This adds an option to test the msg_pull_data helper. This
uses two options txmsg_start and txmsg_end to let the user
specify start and end bytes to pull.

The options can be used with txmsg_apply, txmsg_cork options
as well as with any of the basic tests, txmsg, txmsg_redir and
txmsg_drop (plus noisy variants) to run pull_data inline with
those tests. By giving user direct control over the variables
we can easily do negative testing as well as positive tests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:41 +01:00
John Fastabend e6373ce70a bpf: sockmap add SK_DROP tests
Add tests for SK_DROP.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:41 +01:00
John Fastabend 468b3fdea8 bpf: sockmap sample support for bpf_msg_cork_bytes()
Add sample application support for the bpf_msg_cork_bytes helper. This
lets the user specify how many bytes each verdict should apply to.

Similar to apply_bytes() tests these can be run as a stand-alone test
when used without other options or inline with other tests by using
the txmsg_cork option along with any of the basic tests txmsg,
txmsg_redir, txmsg_drop.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:40 +01:00
John Fastabend 1c16c3126a bpf: sockmap, add sample option to test apply_bytes helper
This adds an option to test the apply_bytes helper. This option lets
the user specify an int on the command line specifying how much data
each verdict should apply to.

When this is set a map entry is set with the bytes input by the user
and then the specified program --txmsg or --txmsg_redir will use the
value and set the applied data. If no other option is set then a
default --txmsg_apply program is run. This program will drop pkts
if an error is detected on the bytes map lookup. Useful to verify
the map lookup and apply helper are working and causing a hard
error if it is not.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:40 +01:00
John Fastabend 6bce9d2ca6 bpf: sockmap sample, add data verification option
To verify data is not being dropped or corrupted this adds an option
to verify test-patterns on recv.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:40 +01:00
John Fastabend e67463cb5d bpf: sockmap sample, add sendfile test
To exercise TX ULP sendpage implementation we need a test that does
a sendfile. Add sendfile test option here.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:40 +01:00
John Fastabend 4c4c3c276c bpf: sockmap sample, add option to attach SK_MSG program
Add sockmap option to use SK_MSG program types.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:40 +01:00
John Fastabend 1acc60b6a4 bpf: add verifier tests for BPF_PROG_TYPE_SK_MSG
Test read and writes for BPF_PROG_TYPE_SK_MSG.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:39 +01:00
John Fastabend 82a8616889 bpf: add map tests for BPF_PROG_TYPE_SK_MSG
Add map tests to attach BPF_PROG_TYPE_SK_MSG types to a sockmap.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:39 +01:00
John Fastabend 015632bb30 bpf: sk_msg program helper bpf_sk_msg_pull_data
Currently, if a bpf sk msg program is run the program
can only parse data that the (start,end) pointers already
consumed. For sendmsg hooks this is likely the first
scatterlist element. For sendpage this will be the range
(0,0) because the data is shared with userspace and by
default we want to avoid allowing userspace to modify
data while (or after) BPF verdict is being decided.

To support pulling in additional bytes for parsing use
a new helper bpf_sk_msg_pull(start, end, flags) which
works similar to cls tc logic. This helper will attempt
to point the data start pointer at 'start' bytes offest
into msg and data end pointer at 'end' bytes offset into
message.

After basic sanity checks to ensure 'start' <= 'end' and
'end' <= msg_length there are a few cases we need to
handle.

First the sendmsg hook has already copied the data from
userspace and has exclusive access to it. Therefor, it
is not necessesary to copy the data. However, it may
be required. After finding the scatterlist element with
'start' offset byte in it there are two cases. One the
range (start,end) is entirely contained in the sg element
and is already linear. All that is needed is to update the
data pointers, no allocate/copy is needed. The other case
is (start, end) crosses sg element boundaries. In this
case we allocate a block of size 'end - start' and copy
the data to linearize it.

Next sendpage hook has not copied any data in initial
state so that data pointers are (0,0). In this case we
handle it similar to the above sendmsg case except the
allocation/copy must always happen. Then when sending
the data we have possibly three memory regions that
need to be sent, (0, start - 1), (start, end), and
(end + 1, msg_length). This is required to ensure any
writes by the BPF program are correctly transmitted.

Lastly this operation will invalidate any previous
data checks so BPF programs will have to revalidate
pointers after making this BPF call.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:39 +01:00
John Fastabend 91843d540a bpf: sockmap, add msg_cork_bytes() helper
In the case where we need a specific number of bytes before a
verdict can be assigned, even if the data spans multiple sendmsg
or sendfile calls. The BPF program may use msg_cork_bytes().

The extreme case is a user can call sendmsg repeatedly with
1-byte msg segments. Obviously, this is bad for performance but
is still valid. If the BPF program needs N bytes to validate
a header it can use msg_cork_bytes to specify N bytes and the
BPF program will not be called again until N bytes have been
accumulated. The infrastructure will attempt to coalesce data
if possible so in many cases (most my use cases at least) the
data will be in a single scatterlist element with data pointers
pointing to start/end of the element. However, this is dependent
on available memory so is not guaranteed. So BPF programs must
validate data pointer ranges, but this is the case anyways to
convince the verifier the accesses are valid.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:39 +01:00
John Fastabend 2a100317c9 bpf: sockmap, add bpf_msg_apply_bytes() helper
A single sendmsg or sendfile system call can contain multiple logical
messages that a BPF program may want to read and apply a verdict. But,
without an apply_bytes helper any verdict on the data applies to all
bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
care to read the first N bytes of a msg. If the payload is large say
MB or even GB setting up and calling the BPF program repeatedly for
all bytes, even though the verdict is already known, creates
unnecessary overhead.

To allow BPF programs to control how many bytes a given verdict
applies to we implement a bpf_msg_apply_bytes() helper. When called
from within a BPF program this sets a counter, internal to the
BPF infrastructure, that applies the last verdict to the next N
bytes. If the N is smaller than the current data being processed
from a sendmsg/sendfile call, the first N bytes will be sent and
the BPF program will be re-run with start_data pointing to the N+1
byte. If N is larger than the current data being processed the
BPF verdict will be applied to multiple sendmsg/sendfile calls
until N bytes are consumed.

Note1 if a socket closes with apply_bytes counter non-zero this
is not a problem because data is not being buffered for N bytes
and is sent as its received.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:39 +01:00
John Fastabend 4f738adba3 bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data
This implements a BPF ULP layer to allow policy enforcement and
monitoring at the socket layer. In order to support this a new
program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
the sendmsg/sendpage hook. To attach the policy to sockets a
sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.

Similar to previous sockmap usages when a sock is added to a
sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
program type attached then the BPF ULP layer is created on the
socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
every msg in sendmsg case and page/offset in sendpage case.

BPF_PROG_TYPE_SK_MSG Semantics/API:

BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
case and in the sendpage case leaves the data untouched. Both cases
return -EACESS to the user. Returning SK_PASS will allow the msg to
be sent.

In the sendmsg case data is copied into kernel space buffers before
running the BPF program. The kernel space buffers are stored in a
scatterlist object where each element is a kernel memory buffer.
Some effort is made to coalesce data from the sendmsg call here.
For example a sendmsg call with many one byte iov entries will
likely be pushed into a single entry. The BPF program is run with
data pointers (start/end) pointing to the first sg element.

In the sendpage case data is not copied. We opt not to copy the
data by default here, because the BPF infrastructure does not
know what bytes will be needed nor when they will be needed. So
copying all bytes may be wasteful. Because of this the initial
start/end data pointers are (0,0). Meaning no data can be read or
written. This avoids reading data that may be modified by the
user. A new helper is added later in this series if reading and
writing the data is needed. The helper call will do a copy by
default so that the page is exclusively owned by the BPF call.

The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
in the sendmsg() case and the entire page/offset in the sendpage case.
This avoids ambiguity on how to handle mixed return codes in the
sendmsg case. Again a helper is added later in the series if
a verdict needs to apply to multiple system calls and/or only
a subpart of the currently being processed message.

The helper msg_redirect_map() can be used to select the socket to
send the data on. This is used similar to existing redirect use
cases. This allows policy to redirect msgs.

Pseudo code simple example:

The basic logic to attach a program to a socket is as follows,

  // load the programs
  bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
		&obj, &msg_prog);

  // lookup the sockmap
  bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");

  // get fd for sockmap
  map_fd_msg = bpf_map__fd(bpf_map_msg);

  // attach program to sockmap
  bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);

Adding sockets to the map is done in the normal way,

  // Add a socket 'fd' to sockmap at location 'i'
  bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);

After the above any socket attached to "my_sock_map", in this case
'fd', will run the BPF msg verdict program (msg_prog) on every
sendmsg and sendpage system call.

For a complete example see BPF selftests or sockmap samples.

Implementation notes:

It seemed the simplest, to me at least, to use a refcnt to ensure
psock is not lost across the sendmsg copy into the sg, the bpf program
running on the data in sg_data, and the final pass to the TCP stack.
Some performance testing may show a better method to do this and avoid
the refcnt cost, but for now use the simpler method.

Another item that will come after basic support is in place is
supporting MSG_MORE flag. At the moment we call sendpages even if
the MSG_MORE flag is set. An enhancement would be to collect the
pages into a larger scatterlist and pass down the stack. Notice that
bpf_tcp_sendmsg() could support this with some additional state saved
across sendmsg calls. I built the code to support this without having
to do refactoring work. Other features TBD include ZEROCOPY and the
TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
shortly.

Future work could improve size limits on the scatterlist rings used
here. Currently, we use MAX_SKB_FRAGS simply because this was being
used already in the TLS case. Future work could extend the kernel sk
APIs to tune this depending on workload. This is a trade-off
between memory usage and throughput performance.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:38 +01:00
John Fastabend 8c05dbf04b net: generalize sk_alloc_sg to work with scatterlist rings
The current implementation of sk_alloc_sg expects scatterlist to always
start at entry 0 and complete at entry MAX_SKB_FRAGS.

Future patches will want to support starting at arbitrary offset into
scatterlist so add an additional sg_start parameters and then default
to the current values in TLS code paths.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:38 +01:00
John Fastabend 312fc2b4c8 net: do_tcp_sendpages flag to avoid SKBTX_SHARED_FRAG
When calling do_tcp_sendpages() from in kernel and we know the data
has no references from user side we can omit SKBTX_SHARED_FRAG flag.
This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used
to omit setting SKBTX_SHARED_FRAG.

The flag is not exposed to userspace because the sendpage call from
the splice logic masks out all bits except MSG_MORE.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:38 +01:00
John Fastabend ffa3566001 sockmap: convert refcnt to an atomic refcnt
The sockmap refcnt up until now has been wrapped in the
sk_callback_lock(). So its not actually needed any locking of its
own. The counter itself tracks the lifetime of the psock object.
Sockets in a sockmap have a lifetime that is independent of the
map they are part of. This is possible because a single socket may
be in multiple maps. When this happens we can only release the
psock data associated with the socket when the refcnt reaches
zero. There are three possible delete sock reference decrement
paths first through the normal sockmap process, the user deletes
the socket from the map. Second the map is removed and all sockets
in the map are removed, delete path is similar to case 1. The third
case is an asyncronous socket event such as a closing the socket. The
last case handles removing sockets that are no longer available.
For completeness, although inc does not pose any problems in this
patch series, the inc case only happens when a psock is added to a
map.

Next we plan to add another socket prog type to handle policy and
monitoring on the TX path. When we do this however we will need to
keep a reference count open across the sendmsg/sendpage call and
holding the sk_callback_lock() here (on every send) seems less than
ideal, also it may sleep in cases where we hit memory pressure.
Instead of dealing with these issues in some clever way simply make
the reference counting a refcnt_t type and do proper atomic ops.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:38 +01:00
John Fastabend 2c3682f0be sock: make static tls function alloc_sg generic sock helper
The TLS ULP module builds scatterlists from a sock using
page_frag_refill(). This is going to be useful for other ULPs
so move it into sock file for more general use.

In the process remove useless goto at end of while loop.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-19 21:14:38 +01:00
David S. Miller c314c7ba40 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2018-03-19

This series contains updates to i40e and i40evf only.

Alex fixes a potential deadlock in the configure_clsflower function in
i40evf, where we exit with the "IN_CRITICAL_TASK" bit set while
notifying the PF of flower filters.

Jan fixed an issue where it was possible to set a mode that is not
allowed which resulted in link being down, so fixed the parity between
i40e_set_link_ksettings() and i40e_get_link_ksettings().

Patryk fixes a bug where a backplane device was allowing the setting of
link settings, which is not allowed.

Shiraz fixes a crash when entering S3 because the client interface was
freeing the MSIx vectors while they are still in use.

Jake fixes up a function header comment to document a newly added
parameter.  Also cleaned up flags that were never used.

Doug fixes the incorrect return type for i40e_aq_add_cloud_filters().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-19 15:12:03 -04:00
Paweł Jabłoński 12d80bca0b i40e: Fix the polling mechanism of GLGEN_RSTAT.DEVSTATE
This fixes the polling mechanism of GLGEN_RSTAT.DEVSTATE in the
PF Reset path when Global Reset is in progress. While the driver
is polling for the end of the PF Reset and the Global Reset is
triggered, abandon the PF Reset path and prepare for the
upcoming Global Reset.

Signed-off-by: Paweł Jabłoński <pawel.jablonski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:57:52 -07:00
Jacob Keller 6b9a9c26ef i40evf: remove flags that are never used
These flags were defined, but there is no use within the driver code, so
we don't need to keep them.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:56:27 -07:00
Patryk Małek 3f8c843727 i40e: Prevent setting link speed on I40E_DEV_ID_25G_B
Setting link settings on backplane devices shouldn't be allowed.
This patch adds one more device id to the list which we check
that against.

Signed-off-by: Patryk Małek <patryk.malek@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:55:00 -07:00
Doug Dziggel 85925cd0b8 i40e: Fix incorrect return types
Fix return types from i40e_status to enum i40e_status_code.

Signed-off-by: Doug Dziggel <douglas.a.dziggel@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:52:32 -07:00
Jacob Keller 35bea90436 i40e: add doxygen comment for new mode parameter
A recent patch updated the signature for i40e_aq_set_switch_config() to
add a new parameter 'mode'. It forgot to document the parameter in the
doxygen function header comment. Add the parameter to the function
description now.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:50:51 -07:00
Shiraz Saleem ddbb8d5dd9 i40e: Close client on suspend and restore client MSIx on resume
During suspend client MSIx vectors are freed while they are still
in use causing a crash on entering S3.

Fix this calling client close before freeing up its MSIx vectors.
Also update the client MSIx vectors on resume before client
open is called.

Fixes commit b980c0634f ("i40e: shutdown all IRQs and disable MSI-X
when suspended")

Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:46:09 -07:00
Patryk Małek 88244a48d2 i40e: Prevent setting link speed on KX_X722
Setting link settings on backplane devices shouldn't be allowed.
This patch adds one more device id to the list which we check
that against.

Signed-off-by: Patryk Małek <patryk.malek@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:44:23 -07:00
Jan Sokolowski 32c23b47db i40e: Properly check allowed advertisement capabilities
The i40e_set_link_ksettings and i40e_get_link_ksettings use different
codepaths to check available and supported advertisement modes. This
creates scenarios where it's possible to set a mode that's not allowed,
resulting in a link down.

Fix setting advertisement in i40e_set_link_ksettings by calling
i40e_get_link_ksettings to check what modes are allowed.

Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:42:20 -07:00
Alexander Duyck 640a8af584 i40evf: Reorder configure_clsflower to avoid deadlock on error
While doing some code review I noticed that we can get into a state where
we exit with the "IN_CRITICAL_TASK" bit set while notifying the PF of
flower filters. This patch is meant to address that plus tweak the ordering
of the while loop waiting on it slightly so that we don't wait an extra
period after we have failed for the last time.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2018-03-19 09:28:03 -07:00
Stefano Brivio e3c72f3d37 selftests: pmtu: Drop prints to kernel log from pmtu_vti6_link_change_mtu
Reported-by: David Ahern <dsahern@gmail.com>
Fixes: 1fad59ea1c ("selftests: pmtu: Add pmtu_vti6_link_change_mtu test")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 17:13:29 -04:00
David S. Miller ef57e31e2b Merge branch 'mv88e6xxx-auto-phy-intr'
Andrew Lunn says:

====================
Automatic PHY interrupts

Now that the mv88e6xxx driver either installs in interrupt handler, or
polls for interrupts, it is possible to always handle PHY interrupts,
rather than have phylib perform the polling. This speeds up detection
of link changes and reduces the load on the MDIO bus, which is
beneficial for PTP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 16:52:59 -04:00
Andrew Lunn 6f88284f3b net: dsa: mv88e6xxx: Add MDIO interrupts for internal PHYs
When registering an MDIO bus, it is possible to pass an array of
interrupts, one per address on the bus. phylib will then associate the
interrupt to the PHY device, if no other interrupt is provided.

Some of the global2 interrupts are PHY interrupts. Place them into the
MDIO bus structure.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 16:52:59 -04:00
Andrew Lunn bc3931557d net: dsa: mv88e6xxx: Add number of internal PHYs
Add to the info structure the number of internal PHYs, if they generate
interrupts. Some of the older generations of switches have internal
PHYs, but no interrupt registers. In this case, set the count to zero.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 16:52:58 -04:00
Andrew Lunn adfccf1182 net: dsa: mv88e6xxx: Add missing g1 IRQ numbers
With the recent change to polling for interrupts, it is important that
the number of global 1 interrupts is listed. Without it, the driver
requests an interrupt domain for zero interrupts, which returns
EINVAL, and the probe fails.

Add two missing entries.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 16:52:58 -04:00
Florian Fainelli ef44d78d89 net: dsa: mv88e6xxx: Fix missing register lock in serdes_get_stats
We can hit the register lock not held assertion with the following path:

[   34.170631] mv88e6085 0.1:00: Switch registers lock not held!
[   34.176510] CPU: 0 PID: 950 Comm: ethtool Not tainted 4.16.0-rc4 #143
[   34.182985] Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
[   34.189519] Backtrace:
[   34.192033] [<8010c4b4>] (dump_backtrace) from [<8010c788>] (show_stack+0x20/0x24)
[   34.199680]  r6:9f5dc010 r5:00000011 r4:9f5dc010 r3:00000000
[   34.205434] [<8010c768>] (show_stack) from [<80679d38>] (dump_stack+0x24/0x28)
[   34.212719] [<80679d14>] (dump_stack) from [<804844a8>] (mv88e6xxx_read+0x70/0x7c)
[   34.220376] [<80484438>] (mv88e6xxx_read) from [<804870dc>] (mv88e6xxx_port_get_cmode+0x34/0x4c)
[   34.229257]  r5:a09cd128 r4:9ee31d07
[   34.232880] [<804870a8>] (mv88e6xxx_port_get_cmode) from [<80487e6c>] (mv88e6352_port_has_serdes+0x24/0x64)
[   34.242690]  r4:9f5dc010
[   34.245309] [<80487e48>] (mv88e6352_port_has_serdes) from [<804880b8>] (mv88e6352_serdes_get_stats+0x28/0x12c)
[   34.255389]  r4:00000001
[   34.257973] [<80488090>] (mv88e6352_serdes_get_stats) from [<804811e8>] (mv88e6xxx_get_ethtool_stats+0xb0/0xc0)
[   34.268156]  r10:00000000 r9:00000000 r8:00000000 r7:a09cd020 r6:00000001 r5:9f5dc01c
[   34.276052]  r4:9f5dc010
[   34.278631] [<80481138>] (mv88e6xxx_get_ethtool_stats) from [<8064f740>] (dsa_slave_get_ethtool_stats+0xbc/0xc4)

mv88e6xxx_get_ethtool_stats() calls mv88e6xxx_get_stats() which calls both
chip->info->ops->stats_get_stats(), which holds the register lock, and
chip->info->ops->serdes_get_stats() which does not. Have
chip->info->ops->serdes_get_stats() be running with the register lock held to
avoid such assertions.

Fixes: 436fe17d27 ("net: dsa: mv88e6xxx: Allow the SERDES interfaces to have statistics")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 16:35:44 -04:00
Andrew Lunn 76f38f1f3c net: dsa: mv88e6xxx: Fix IRQ when loading module
Handle polled interrupts correctly when loading the module.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Fixes: 294d711ee8 ("net: dsa: mv88e6xxx: Poll when no interrupt defined")
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:24:48 -04:00
David S. Miller 3661daa66f Merge branch 'selftests-pmtu-Add-further-vti-vti6-MTU-and-PMTU-tests'
Stefano Brivio says:

====================
selftests: pmtu: Add further vti/vti6 MTU and PMTU tests

Patches 5/10 to 10/10 add tests to verify default MTU assignment
for vti4 and vti6 interfaces, to check that MTU values set on new
link and link changes are properly taken and validated, and to
verify PMTU exceptions on vti4 interfaces.

Patch 1/10 reverses function return codes as suggested by David
Ahern.

Patch 2/10 fixes the helper to fetch exceptions MTU to run in the
passed namespace.

Patches 3/10 and 4/10 are preparation work to make it easier to
introduce those tests.

v2: Reverse return codes, and make output prettier in 4/9 by
    using padded printf, test descriptions and buffered error
    strings. Remove accidental output to /dev/kmsg from 10/10
    (was 9/9).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 1fad59ea1c selftests: pmtu: Add pmtu_vti6_link_change_mtu test
This test checks that MTU configured from userspace is used on
link creation and changes, and that when it's not passed from
userspace, it's calculated properly from the MTU of the lower
layer.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 8b6022fc78 selftests: pmtu: Add pmtu_vti6_link_add_mtu test
Same as pmtu_vti4_link_add_mtu test, but for IPv6.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 719e121574 selftests: pmtu: Add pmtu_vti4_link_add_mtu test
This test checks that MTU given on vti link creation is actually
configured, and that tunnel is not created with an invalid MTU
value.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 5e84430bb8 selftests: pmtu: Add test_pmtu_vti4_exception test
This test checks that PMTU exceptions are created only when
needed on IPv4 routes with vti and xfrm, and their PMTU value is
checked as well.

We can't adopt the same approach as test_pmtu_vti6_exception()
here, because on IPv4 administrative MTU changes won't be
reflected directly on PMTU.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 35b49424b8 selftests: pmtu: Add pmtu_vti6_default_mtu test
Same as pmtu_vti4_default_mtu, but on IPv6 with vti6.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio a41c789bdc selftests: pmtu: Add pmtu_vti4_default_mtu test
This test checks that the MTU assigned by default to a vti (IPv4)
interface created on top of veth is simply veth's MTU minus the
length of the encapsulated IPv4 header.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 36455bd1e9 selftests: pmtu: Introduce support for multiple tests
Introduce list of tests and their descriptions, and loop on it
in main body.

Tests will now just take care of calling setup with a list of
"units" they need, and return 0 on success, 1 on failure, 2 if
the test had to be skipped.

Main script body will take care of displaying results and
cleaning up after every test. Introduce guard variable so that
we don't clean up twice in case of interrupts or unexpected
failures.

The pmtu_vti6_exception test can now run its third step even if
the previous one failed, as we can return values from it.

Also introduce support to display test descriptions, and display
aligned OK/FAIL/SKIP test outcomes. Buffer error strings so that
in case of failure we can display them right under the outcome
for each test.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio f2c929feec selftests: pmtu: Factor out MTU parsing helper
...so that it can be used for any iproute command output.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:14 -04:00
Stefano Brivio 822d2f86c4 selftests: pmtu: Use namespace command prefix to fetch route mtu
In 7af137b72131 ("selftests: net: Introduce first PMTU test") I
accidentally assumed route_get_* helpers would run from a single
namespace. Make them a bit more generic, by passing the
namespace command prefix as a parameter instead.

Fixes: 7af137b72131 ("selftests: net: Introduce first PMTU test")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:13 -04:00
Stefano Brivio 380e29a6b7 selftests: pmtu: Reverse return codes of functions
David suggests it's more intuitive to return non-zero on
failures, and zero on success.

No need to introduce tail 'return 0' in functions, they will
return the exit code of the last command anyway.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:15:13 -04:00
David S. Miller fe2d55d295 Merge branch 'ibmvnic-Update-TX-pool-and-TX-routines'
Thomas Falcon says:

====================
ibmvnic: Update TX pool and TX routines

This patch restructures the TX pool data structure and provides a
separate TX pool array for TSO transmissions. This is already used
in some way due to our unique DMA situation, namely that we cannot
use single DMA mappings for packet data. Previously, both buffer
arrays used the same pool entry. This restructuring allows for
some additional cleanup in the driver code, especially in some
places in the device transmit routine.

In addition, it allows us to more easily track the consumer
and producer indexes of a particular pool. This has been
further improved by better tracking of in-use buffers to
prevent possible data corruption in case an invalid buffer
entry is used.

v5: Fix bisectability mistake in the first patch. Removed
    TSO-specific data in a later patch when it is no longer used.

v4: Fix error in 7th patch that causes an oops by using
    the older fixed value for number of buffers instead
    of the respective field in the tx pool data structure

v3: Forgot to update TX pool cleaning function to handle new data
    structures. Included 7th patch for that.

v2: Fix typo in 3/6 commit subject line
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:12:40 -04:00
Thomas Falcon 76c15c911b ibmvnic: Remove unused TSO resources in TX pool structure
Finally, remove the TSO-specific fields in the TX pool
strcutures. These are no longer needed with the introduction
of separate buffer pools for TSO transmissions.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:12:39 -04:00
Thomas Falcon e9e1e97884 ibmvnic: Update TX pool cleaning routine
Update routine that cleans up any outstanding transmits that
have not received completions when the device needs to close.
Introduces a helper function that cleans one TX pool to make
code more readable.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:12:39 -04:00
Thomas Falcon 86b61a5f2e ibmvnic: Improve TX buffer accounting
Improve TX pool buffer accounting to prevent the producer
index from overruning the consumer. First, set the next free
index to an invalid value if it is in use. If next buffer
to be consumed is in use, drop the packet.

Finally, if the transmit fails for some other reason, roll
back the consumer index and set the free map entry to its original
value. This should also be done if the DMA map fails.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:12:39 -04:00
Thomas Falcon 06b3e35788 ibmvnic: Update TX and TX completion routines
Update TX and TX completion routines to account for TX pool
restructuring. TX routine first chooses the pool depending
on whether a packet is GSO or not, then uses it accordingly.

For the completion routine to know which pool it needs to use,
set the most significant bit of the correlator index to one
if the packet uses the TSO pool. On completion, unset the bit
and use the correlator index to release the buffer pool entry.

Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-17 20:12:39 -04:00