Commit Graph

725710 Commits

Author SHA1 Message Date
Jesper Dangaard Brouer e2e3224122 samples/bpf: xdp2skb_meta comment explain why pkt-data pointers are invalidated
Improve the 'unknown reason' comment, with an actual explaination of why
the ctx pkt-data pointers need to be loaded after the helper function
bpf_xdp_adjust_meta().  Based on the explaination Daniel gave.

Fixes: 36e04a2d78 ("samples/bpf: xdp2skb_meta shows transferring info from XDP to SKB")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:49:09 +01:00
Daniel Borkmann cda18e9726 Merge branch 'bpf-dump-and-disasm-nfp-jit'
Jakub Kicinski says:

====================
Jiong says:

Currently bpftool could disassemble host jited image, for example x86_64,
using libbfd. However it couldn't disassemble offload jited image.

There are two reasons:

  1. bpf_obj_get_info_by_fd/struct bpf_prog_info couldn't get the address
     of jited image and image's length.

  2. Even after issue 1 resolved, bpftool couldn't figure out what is the
     offload arch from bpf_prog_info, therefore can't drive libbfd
     disassembler correctly.

  This patch set resolve issue 1 by introducing two new fields "jited_len"
and "jited_image" in bpf_dev_offload. These two fields serve as the generic
interface to communicate the jited image address and length for all offload
targets to higher level caller. For example, bpf_obj_get_info_by_fd could
use them to fill the userspace visible fields jited_prog_len and
jited_prog_insns.

  This patch set resolve issue 2 by getting bfd backend name through
"ifindex", i.e network interface index.

v1:
 - Deduct bfd arch name through ifindex, i.e network interface index.
   First, map ifindex to devname through ifindex_to_name_ns, then get
   pci id through /sys/class/dev/DEVNAME/device/vendor. (Daniel, Alexei)
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:26:16 +01:00
Jiong Wang e65935969d tools: bpftool: improve architecture detection by using ifindex
The current architecture detection method in bpftool is designed for host
case.

For offload case, we can't use the architecture of "bpftool" itself.
Instead, we could call the existing "ifindex_to_name_ns" to get DEVNAME,
then read pci id from /sys/class/dev/DEVNAME/device/vendor, finally we map
vendor id to bfd arch name which will finally be used to select bfd backend
for the disassembler.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:26:15 +01:00
Jiong Wang eb1d7db927 nfp: bpf: set new jit info fields
This patch set those new jit info fields introduced in this patch set.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:26:15 +01:00
Jiong Wang fcfb126def bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info
For host JIT, there are "jited_len"/"bpf_func" fields in struct bpf_prog
used by all host JIT targets to get jited image and it's length. While for
offload, targets are likely to have different offload mechanisms that these
info are kept in device private data fields.

Therefore, BPF_OBJ_GET_INFO_BY_FD syscall needs an unified way to get JIT
length and contents info for offload targets.

One way is to introduce new callback to parse device private data then fill
those fields in bpf_prog_info. This might be a little heavy, the other way
is to add generic fields which will be initialized by all offload targets.

This patch follow the second approach to introduce two new fields in
struct bpf_dev_offload and teach bpf_prog_get_info_by_fd about them to fill
correct jited_prog_len and jited_prog_insns in bpf_prog_info.

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-18 01:26:15 +01:00
David S. Miller 4f7d58517f linux-can-next-for-4.16-20180116
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEE4bay/IylYqM/npjQHv7KIOw4HPYFAlpeCCQTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRAe/sog7Dgc9ti9B/4jAoCYTuYXnkRvS34jdgQQV99QyC8M
 EpcAgzHo2kYNPDf5q8TEVheXxvA9XeGiA+TtjL9cNowxwMMtJev3/dmBtOmU6jek
 RgOsYlR8guxBHOx8pj1IMl1xPoWTLfg4Kv1qnpXx3zvhOP610G/edBBSZt659MGF
 SW6pBoNivbl+EYSM5x81QIfqJ4NlD5AKY4PeeSnGrnSthd8EFNp2zKkcY8nMOJ0D
 Gm2YyxGJXh+lHse965DQRNg+owZxIWyheQplulVrw9v34LOjbFko3Cd+D9KLW5MG
 LVVRJ3E0jm7W75AvxNcv2WP+lZVcDxXqsxFH0dP8WOJNZiKZeJ5aZfji
 =ROXk
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.16-20180116' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2018-01-16

this is a pull request for net-next/master consisting of 9 patches.

This is a series of patches, some of them initially by Franklin S Cooper
Jr, which was picked up by Faiz Abbas. Faiz Abbas added some patches
while working on this series, I contributed one as well.

The first two patches add support to CAN device infrastructure to limit
the bitrate of a CAN adapter if the used CAN-transceiver has a certain
maximum bitrate.

The remaining patches improve the m_can driver. They add support for
bitrate limiting to the driver, clean up the driver and add support for
runtime PM.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 16:08:25 -05:00
Luis de Bethencourt 5ef7e0ba10 vxlan: Fix trailing semicolon
The trailing semicolon is an empty statement that does no operation.
It is completely stripped out by the compiler. Removing it since it doesn't do
anything.

Fixes: 5f35227ea3 ("net: Generalize ndo_gso_check to ndo_features_check")
Signed-off-by: Luis de Bethencourt <luisbg@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 16:07:24 -05:00
Ganesh Goudar baf5086840 cxgb4: restructure VF mgmt code
restructure the code which adds support for configuring
PCIe VF via mgmt netdevice. which was added by
commit 7829451c69 ("cxgb4: Add control net_device for
configuring PCIe VF")

Original work by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 16:04:48 -05:00
Kirill Tkhai 42157277af net: Remove spinlock from get_net_ns_by_id()
idr_find() is safe under rcu_read_lock() and
maybe_get_net() guarantees that net is alive.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:42:35 -05:00
Kirill Tkhai 0c06bea919 net: Fix possible race in peernet2id_alloc()
peernet2id_alloc() is racy without rtnl_lock() as refcount_read(&peer->count)
under net->nsid_lock does not guarantee, peer is alive:

rcu_read_lock()
peernet2id_alloc()                            ..
  spin_lock_bh(&net->nsid_lock)               ..
  refcount_read(&peer->count) (!= 0)          ..
  ..                                          put_net()
  ..                                            cleanup_net()
  ..                                              for_each_net(tmp)
  ..                                                spin_lock_bh(&tmp->nsid_lock)
  ..                                                __peernet2id(tmp, net) == -1
  ..                                                    ..
  ..                                                    ..
    __peernet2id_alloc(alloc == true)                   ..
  ..                                                    ..
rcu_read_unlock()                                       ..
..                                                synchronize_rcu()
..                                                kmem_cache_free(net)

After the above situation, net::netns_id contains id pointing to freed memory,
and any other dereferencing by the id will operate with this freed memory.

Currently, peernet2id_alloc() is used under rtnl_lock() everywhere except
ovs_vport_cmd_fill_info(), and this race can't occur. But peernet2id_alloc()
is generic interface, and better we fix it before someone really starts
use it in wrong context.

v2: Don't place refcount_read(&net->count) under net->nsid_lock
    as suggested by Eric W. Biederman <ebiederm@xmission.com>
v3: Rebase on top of net-next

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:42:35 -05:00
David S. Miller a29ae44c7a Merge branch 'tun-allow-to-attach-eBPF-filter'
Jason Wang says:

====================
tun: allow to attach eBPF filter

This series tries to implement eBPF socket filter for tun. This could
be used for implementing efficient virtio-net receive filter for
vhost-net.

Changes from V2:
- fix typo
- remove unnecessary double check
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:32:11 -05:00
Jason Wang aff3d70a07 tun: allow to attach ebpf socket filter
This patch allows userspace to attach eBPF filter to tun. This will
allow to implement VM dataplane filtering in a more efficient way
compared to cBPF filter by allowing either qemu or libvirt to
attach eBPF filter to tun.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:32:10 -05:00
Jason Wang cd5681d7d8 tuntap: rename struct tun_steering_prog to struct tun_prog
To be reused by other eBPF program other than queue selection.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 15:32:10 -05:00
David S. Miller ca46abd6f8 Merge branch 'net-sched-allow-qdiscs-to-share-filter-block-instances'
Jiri Pirko says:

====================
net: sched: allow qdiscs to share filter block instances

Currently the filters added to qdiscs are independent. So for example if you
have 2 netdevices and you create ingress qdisc on both and you want to add
identical filter rules both, you need to add them twice. This patchset
makes this easier and mainly saves resources allowing to share all filters
within a qdisc - I call it a "filter block". Also this helps to save
resources when we do offload to hw for example to expensive TCAM.

So back to the example. First, we create 2 qdiscs. Both will share
block number 22. "22" is just an identification:
$ tc qdisc add dev ens7 ingress_block 22 ingress
                        ^^^^^^^^^^^^^^^^
$ tc qdisc add dev ens8 ingress_block 22 ingress
                        ^^^^^^^^^^^^^^^^

If we don't specify "block" command line option, no shared block would
be created:
$ tc qdisc add dev ens9 ingress

Now if we list the qdiscs, we will see the block index in the output:

$ tc qdisc
qdisc ingress ffff: dev ens7 parent ffff:fff1 ingress_block 22
qdisc ingress ffff: dev ens8 parent ffff:fff1 ingress_block 22
qdisc ingress ffff: dev ens9 parent ffff:fff1

To make is more visual, the situation looks like this:

   ens7 ingress qdisc                 ens7 ingress qdisc
          |                                  |
          |                                  |
          +---------->  block 22  <----------+

Unlimited number of qdiscs may share the same block.

Note that this patchset introduces block sharing support also for clsact
qdisc:
$ tc qdisc add dev ens10 ingress_block 23 egress_block 24 clsact
$ tc qdisc show dev ens10
qdisc clsact ffff: dev ens10 parent ffff:fff1 ingress_block 23 egress_block 24

We can add filter using the block index:

$ tc filter add block 22 protocol ip pref 25 flower dst_ip 192.168.0.0/16 action drop

Note we cannot use the qdisc for filter manipulations of shared blocks:

$ tc filter add dev ens8 ingress protocol ip pref 1 flower dst_ip 192.168.100.2 action drop
Error: This filter block is shared. Please use the block index to manipulate the filters.

We will see the same output if we list filters for ingress qdisc of
ens7 and ens8, also for the block 22:

$ tc filter show block 22
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

$ tc filter show dev ens7 ingress
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

$ tc filter show dev ens8 ingress
filter block 22 protocol ip pref 25 flower chain 0
filter block 22 protocol ip pref 25 flower chain 0 handle 0x1
...

---
v10->v11:
- patch 2:
 - fixed error path when register_pernet_subsys fails pointed out by Cong
- patch 9:
 - rebased on top of the current net-next

v9->v10:
- patch 7:
 - fixed ifindex magic in the patch description
- userspace patches:
 - added manpages and patch descriptions

v8->v9:
- patch "net: sched: add rt netlink message type for block get" was
  removed, userspace check filter existence using qdisc dump

v7->v8:
- patch 7:
 - added comment to ifindex block magic
- patch 9:
 - new patch
- patch 10:
 - base this on the patch that introduces qdisc-generic block index
   attributes parsing/dumping
- patch 13:
 - rebased on top of current net-next

v6->v7:
- patch 1:
 - unsquashed shared block patch that was previously squashed by mistake
 - fixed error path in block create - freeing chain 0
- patch 2:
 - new patch - splitted from the previous one as it got accidentaly
   squashed in the rebasing process in the past
 - converted to idr extended
 - removed auto-generating of block indexes. Callers have to explicily
   tell that the block is shared by passing non-zero block index
 - fixed error path in block get ext - freeing chain 0
- patch 7:
 - changed extack message for block index handle as suggested by DaveA
 - added extack message when block index does not exist
 - the block ifindex magic is in define and change to 0xffffffff
   as suggested by Jamal
- patch 8:
 - new patch implementing RTM_GETBLOCK in order to query if the block
   with some index exists
- patch 9:
 - adjust to the core changes and check block index attributes for being 0

v5->v6:
- added patch 6 that introduces block handle

v4->v5:
- patch 5:
 - add tracking of binding of devs that are unable to offload and check
   that before block cbs call.

v3->v4:
- patch 1:
 - rebased on top of the current net-next
 - added some extack strings
- patch 3:
 - rebased on top of the current net-next
- patch 5:
 - propagate netdev_ops->ndo_setup_tc error up to tcf_block_offload_bind
   caller
- patch 7:
 - rebased on top of the current net-next

v2->v3:
- removed original patch 1, removing tp->q cls_bpf dependency. Fixed by
  Jakub in the meantime.
- patch 1:
 - rebased on top of the current net-next
- patch 5:
 - new patch
- patch 8:
 - removed "p_" prefix from block index function args
- patch 10:
 - add tc offload feature handling
====================

Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:58 -05:00
Jiri Pirko 4b23258d6a mlxsw: spectrum_acl: Pass mlxsw_sp_port down to ruleset bind/unbind ops
No need to convert from mlxsw_sp_port to net_device and back again.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:58 -05:00
Jiri Pirko 3aaff32304 mlxsw: spectrum_acl: Implement TC block sharing
Benefit from the prepared TC and in-driver ACL infrastructure and
introduce block sharing offload. For that, a new struct "block" is
introduced in spectrum_acl in order to hold a list of specific
block-port bindings.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:58 -05:00
Jiri Pirko 02caf4995a mlxsw: spectrum_acl: Don't store netdev and ingress for ruleset unbind
Instead, pass netdev and ingress flag to ruleset unbind op.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko 9fe5fdf27e mlxsw: spectrum_acl: Reshuffle code around mlxsw_sp_acl_ruleset_create/destroy
In order to prepare for follow-up changes, make the bind/unbind helpers
very simple. That required move of ht insertion/removal and bind/unbind
calls into mlxsw_sp_acl_ruleset_create/destroy.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko 51ab2994c3 net: sched: allow ingress and clsact qdiscs to share filter blocks
Benefit from the previously introduced shared filter blocks
infrastructure and allow ingress and clsact qdisc instances to share
filter blocks. The block index is coming from userspace as qdisc option.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko d47a6b0e7c net: sched: introduce ingress/egress block index attributes for qdisc
Introduce two new attributes to be used for qdisc creation and dumping.
One for ingress block, one for egress block. Introduce a set of ops that
qdisc which supports block sharing would implement.

Passing block indexes in qdisc change is not supported yet and it is
checked and forbidded.

In future, these attributes are to be reused for specifying block
indexes for classes as well. As of this moment however, it is not
supported so a check is in place to forbid it.

Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko 7960d1daf2 net: sched: use block index as a handle instead of qdisc when block is shared
As the tcm_ifindex with value TCM_IFINDEX_MAGIC_BLOCK is invalid ifindex,
use it to indicate that we work with block, instead of qdisc.
So if tcm_ifindex is set to TCM_IFINDEX_MAGIC_BLOCK, tcm_parent is used
to carry block_index.

If the block is set to be shared between at least 2 qdiscs, it is
forbidden to use the qdisc handle to add/delete filters. In that case,
userspace has to pass block_index.

Also, for dump of the filters, in case the block is shared in between at
least 2 qdiscs, the each filter is dumped with tcm_ifindex value
TCM_IFINDEX_MAGIC_BLOCK and tcm_parent set to block_index. That gives
the user clear indication, that the filter belongs to a shared block
and not only to one qdisc under which it is dumped.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko caa7260156 net: sched: keep track of offloaded filters and check tc offload feature
During block bind, we need to check tc offload feature. If it is
disabled yet still the block contains offloaded filters, forbid the
bind. Also forbid to register callback for a block that already
contains offloaded filters, as the play back is not supported now.
For keeping track of offloaded filters there is a new counter
introduced, alongside with couple of helpers called from cls_* code.
These helpers set and clear TCA_CLS_FLAGS_IN_HW flag.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:57 -05:00
Jiri Pirko edf6711c98 net: sched: remove classid and q fields from tcf_proto
Both are no longer used, so remove them.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko f36fe1c498 net: sched: introduce block mechanism to handle netif_keep_dst calls
Couple of classifiers call netif_keep_dst directly on q->dev. That is
not possible to do directly for shared blocke where multiple qdiscs are
owning the block. So introduce a infrastructure to keep track of the
block owners in list and use this list to implement block variant of
netif_keep_dst.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko 9d3aaff3d8 net: sched: avoid usage of tp->q in tcf_classify
Use block index in the messages instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko 4861738775 net: sched: introduce shared filter blocks infrastructure
Allow qdiscs to share filter blocks among them. Each qdisc type has to
use block get/put extended modifications that enable sharing.
Shared blocks are tracked within each net namespace and identified
by u32 index. This index is passed from user during the qdisc creation.
If user passes index that is not used by any other qdisc, new block
is created. If user passes index that is already used, the existing
block will be re-used.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
Jiri Pirko a9b19443ed net: sched: introduce support for multiple filter chain pointers registration
So far, there was possible only to register a single filter chain
pointer to block->chain[0]. However, when the blocks will get shareable,
we need to allow multiple filter chain pointers registration.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:53:56 -05:00
David S. Miller c9a824210f Merge branch 'bnxt_en-next'
Michael Chan says:

====================
bnxt_en: Updates for net-next.

First, we upgrade the firmware interface spec.  Due to a change in
the toolchains, the auto-generated bnxt_hsi.h does not match the
old bnxt_hsi.h and the patch is really big.  This should be just
one-time.  Going forward, changes should be incremental.

The next 10 patches implement a new scheme for the PF and VF drivers
to allocate and reserve resources.  The new scheme is more flexible
and allows dynamic and asymmetric distribution of resources, whereas
the old scheme is static and even distribution.

The last few patches add cacheline size setting, a couple of PCI IDs,
better management of VF MAC address, and a better parent switchdev ID
for dual-port devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:27 -05:00
Sathya Perla dd4ea1da12 bnxt_en: export a common switchdev PARENT_ID for all reps of an adapter
Currently the driver exports different switchdev PARENT_IDs for
representors belonging to different SR-IOV PF-pools of an adapter.
This is not correct as the adapter can switch across all vports
of an adapter. This patch fixes this by exporting a common switchdev
PARENT_ID for all reps of an adapter. The PCIE DSN is used as the id.

Signed-off-by: Sathya Perla <sathya.perla@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:27 -05:00
Michael Chan c3480a6037 bnxt_en: Add cache line size setting to optimize performance.
The chip supports 64-byte and 128-byte cache line size for more optimal
DMA performance when matched to the CPU cache line size.  The default is 64.
If the system is using 128-byte cache line size, set it to 128.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:27 -05:00
Vasundhara Volam 91cdda4071 bnxt_en: Forward VF MAC address to the PF.
Forward hwrm_func_vf_cfg command from VF to PF driver, to store
VF MAC address in PF's context.  This will allow "ip link show"
to display all VF MAC addresses.

Maintain 2 locations of MAC address in VF info structure, one for
a PF assigned MAC and one for VF assigned MAC.

Display VF assigned MAC in "ip link show", only if PF assigned MAC is
not valid.

Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:27 -05:00
Vasundhara Volam 92abef361b bnxt_en: Add BCM5745X NPAR device IDs
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 8f23d638b3 bnxt_en: Expand bnxt_check_rings() to check all resources.
bnxt_check_rings() is called by ethtool, XDP setup, and ndo_setup_tc()
to see if there are enough resources to support the new configuration.
Expand the call to test all resources if the firmware supports the new
API.  With the more flexible resource allocation scheme, this call must
be made to check that all resources are available before committing to
allocate the resources.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 4673d66468 bnxt_en: Implement new method for the PF to assign SRIOV resources.
Instead of the old method of evenly dividing the resources to the VFs,
use the new firmware API to specify min and max resources for each VF.
This way, there is more flexibility for each VF to allocate more or less
resources.

The min is the absolute minimum for each VF to function.  The max is the
global resources minus the resources used by the PF.  Each VF is
guaranteed the min.  Up to max resources may be available for some VFs.

The PF driver can use one of 2 strategies specified in NVRAM to assign
the resources.  The old legacy strategy of evenly dividing the resources
or the new flexible strategy.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 6a1eef5b90 bnxt_en: Reserve resources for RFS.
In bnxt_rfs_capable(), add call to reserve vnic resources to support
NTUPLE.  Return true if we can successfully reserve enough vnics.
Otherwise, reserve the minimum 1 VNIC for normal operations not
supporting NTUPLE and return false.

Also, suppress warning message about not enough resources for NTUPLE when
only 1 RX ring is in use.  NTUPLE filters by definition require multiple
RX rings.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 674f50a5b0 bnxt_en: Implement new method to reserve rings.
The new method will call firmware to reserve the desired tx, rx, cmpl
rings, ring groups, stats context, and vnic resources.  A second query
call will check the actual resources that firmware is able to reserve.
The driver will then trim and adjust based on the actual resources
provided by firmware.  The driver will then reserve the final resources
in use.

This method is a more flexible way of using hardware resources.  The
resources are not fixed and can by adjusted by firmware.  The driver
adapts to the available resources that the firmware can reserve for
the driver.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 58ea801ac4 bnxt_en: Set initial default RX and TX ring numbers the same in combined mode.
In combined mode, the driver is currently not setting RX and TX ring
numbers the same when firmware can allocate more RX than TX or vice versa.
This will confuse the user as the ethtool convention assumes they are the
same in combined mode.  Fix it by adding bnxt_trim_dflt_sh_rings() to trim
RX and TX ring numbers to be the same as the completion ring number in
combined mode.

Note that if TCs are enabled and/or XDP is enabled, the number of TX rings
will not be the same as RX rings in combined mode.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan be0dd9c410 bnxt_en: Add the new firmware API to query hardware resources.
The new API HWRM_FUNC_RESOURCE_QCAPS provides min and max hardware
resources.  Use the new API when it is supported by firmware.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 6a4f294705 bnxt_en: Refactor hardware resource data structures.
In preparation for new firmware APIs to allocate hardware resources,
add a new struct bnxt_hw_resc to hold various min, max and reserved
resources.  This new structure is common for PFs and VFs.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:26 -05:00
Michael Chan 80fcaf46c0 bnxt_en: Restore MSIX after disabling SRIOV.
After SRIOV has been enabled and disabled, the MSIX vectors assigned to
the VFs have to be re-initialized.  Otherwise they cannot be re-used by
the PF.  For example, increasing the number of PF rings after disabling
SRIOV may fail if the PF uses MSIX vectors previously assigned to the VFs.

To fix this, we add logic in bnxt_restore_pf_fw_resources() to close the
NIC, clear and re-init MSIX, and re-open the NIC.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:25 -05:00
Michael Chan 86e953db01 bnxt_en: Refactor bnxt_close_nic().
Add a new __bnxt_close_nic() function to do all the work previously done
in bnxt_close_nic() except waiting for SRIOV configuration.  The new
function will be used in the next patch as part of SRIOV cleanup.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:25 -05:00
Michael Chan 894aa69a90 bnxt_en: Update firmware interface to 1.9.0.
The version has new firmware APIs to allocate PF/VF resources more
flexibly.

New toolchains were used to generate this file, resulting in a one-time
large diffstat.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:48:25 -05:00
David S. Miller ee81098efe Merge branch 'dwmac-meson8b-clock-fixes-for-Meson8b'
Martin Blumenstingl says:

====================
dwmac-meson8b: clock fixes for Meson8b

this series is now successfully tested, thus we think it's ready to be
applied to your net-next tree.

Emiliano reported [0] that he couldn't get dwmac-meson8b to work on his
Odroid-C1. This is the (hopefully) final version of this series, which
was successfully tested.

Due to the fact that the public S805/S905/S912 datasheets all seem to
be outdated regarding the description of the PRG_ETH0 (also called
PRG_ETHERNET_ADDR0) register Linus Lüssing offered to help testing with
an oscilloscope and an Odroid-C1. I would like to say HUGE thanks to him
at this point as he spent hours figuring out the effects of the bits
that are (though to be) relevant to get Ethernet working on the
Odroid-C1.
We tested three scenarios, all based on version 3 of this series:
1) MPLL2 at ~500MHz, m250_div set to 1, bit 10 enabled
this resulted in a clock rate twice as high as expected at the RGMII TX
clock pin (250MHz instead of 125MHz for Gbit connections and 50MHz
instead of 25MHz for 100Mbit/s connections). it did not change the
rate at the XTAL_IN pin of PHY (which stayed consistenly at 25MHz)
2) MPLL2 at ~250MHz, m250_div set to 1, bit 10 disabled
the oscilloscope shows "no clock" for the RGMII TX clock pin at it's
highest resolution (and random rates at lower resolutions). XTAL_IN is
still at 25MHz
3) MPLL2 at ~250MHz, m250_div set to 1, bit 10 enabled
this resulted in a 125MHz signal at the RGMII TX clock pin for Gbit
speeds and 25MHz for 100Mbit/s - both values are as expected. The rate
on the XTAL_IN pin was at 25MHz
-> boot-logs (with the PRG_ETH0 register value) and screenshots from the
readings of the oscilloscope can be found at:
https://metameute.de/~tux/linux/amlogic/odroidc1/ethernet/

Version 4 of this series is based on the results from Linus Lüssing's
help with the oscilloscope and Odroid-C1.
Unfortunately I don't have any Meson8b boards with RGMII PHY so I could
only partially test this. @Emiliano: Could you please give this version
a try and let me know about the results (preferably with a "Tested-by"
if it works)?
You obviously still need your two "ARM: dts: meson8b" patches which
- add the amlogic,meson8b-dwmac" compatible to meson8b.dtsi
- enable Ethernet on the Odroid-C1 (according to your last thest a TX
  delay of 4ns is required to make it work properly)

When testing on Meson8b this also needs a fix for the MPLL clock driver:
"clk: meson: mpll: use 64-bit maths in params_from_rate", see:
https://patchwork.kernel.org/patch/10131677/

I have tested this myself on a Khadas VIM (GXL SoC, internal RMII PHY)
and a Khadas VIM2 (GXM SoC, external RGMII PHY). Both are still working
fine (so let's hope that this also fixes your Meson8b issue :)).

changes since v4 at [4]:
- dropped "RFT" status since Jerome tested this series successfully!
- dropped PATCH #2 ("simplify generating the clock names"). I will
  improve the whole clock registration in a separate series. since that
  patch didn't really improve anything I dropped it for now
- added Jerome's Acked-/Reviewed-/Tested-by's - many thanks!

changes since v3 at [3]:
- renamed the function PATCH #1 from meson8b_init_rgmii_clk to
  meson8b_init_rgmii_tx_clk since we now know what the register bits
  mean
- rewrote PATCH #3 because bit 10 is a gate clock and it seems that
  there is an internal fixed divide-by-2 clock. see the patch
  description for a detailed explanation
- updated the description of PATCH #4 and #5 as the clock we're trying
  to fix is the "RGMII TX" clock (old version stated that this is the
  "RGMII clock" or "PHY reference clock"). also updated the numbers in
  the description now that we have the clock hierarchy right (at least
  we hope so)

changes since v2 at [2]:
- added PATCH #2 to make the following patch easier
- Emiliano reported that there's currently another bug in the
  dwmac-meson8b driver which prevents it from working with RGMII PHYs on
  Meson8b: bit 10 of the PRG_ETH0 register is configures a clock gate
  (instead of a divide by 5 or divide by 10 clock divider). This has not
  been visible on GXBB and later due to the input clock which always led
  to a selection of "divide by 10" (which is done internally in the IP
  block, but the bit actually means "enable RGMII clock output").
  PATCH #3 was added to address this issue.
- the commit message of PATCH #4 and #5 (formerly PATCH #2 and #3) were
  updated and the patch itself rebased because the m25_div clock was
  removed with the new PATCH #3 (so some of the statements were not
  valid anymore)

changes since v1 at [1]:
- changed the subject of the cover-letter to indicate that this is all
  about the RGMII clock
- added PATCH #1 which ensures that we don't unnecessarily change the
  parent clocks in RMII mode (and also makes the code easier to
  understand)
- changed subject of PATCH #2 (formerly PATCH #1) to state that this
  is about the RGMII clock
- added Jerome's Reviewed-by to PATCH #2 (formerly PATCH #1)
- replaced PATCH #3 (formerly PATCH #2) with one that sets
  CLK_SET_RATE_PARENT on the mux and thus re-configures the MPLL2 clock
  on Meson8b correctly

[0] http://lists.infradead.org/pipermail/linux-amlogic/2017-December/005596.html
[1] http://lists.infradead.org/pipermail/linux-amlogic/2017-December/005848.html
[2] http://lists.infradead.org/pipermail/linux-amlogic/2017-December/005861.html
[3] http://lists.infradead.org/pipermail/linux-amlogic/2017-December/005899.html
[4] http://lists.infradead.org/pipermail/linux-amlogic/2018-January/006125.html
====================

Tested-by: Emiliano Ingrassia <ingrassia@epigenesys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:41:06 -05:00
Martin Blumenstingl fb7d38a70e net: stmmac: dwmac-meson8b: propagate rate changes to the parent clock
On Meson8b the only valid input clock is MPLL2. The bootloader
configures that to run at 500002394Hz which cannot be divided evenly
down to 125MHz using the m250_div clock. Currently the common clock
framework chooses a m250_div of 2 - with the internal fixed
"divide by 10" this results in a RGMII TX clock of 125001197Hz (120Hz
above the requested 125MHz).

Letting the common clock framework propagate the rate changes up to the
parent of m250_mux allows us to get the best possible clock rate. With
this patch the common clock framework calculates a rate of
very-close-to-250MHz (249999701Hz to be exact) for the MPLL2 clock
(which is the mux input). Dividing that by 2 (which is an internal,
fixed divider for the RGMII TX clock) gives us an RGMII TX clock of
124999850Hz (which is only 150Hz off the requested 125MHz, compared to
1197Hz based on the MPLL2 rate set by u-boot and the Amlogic GPL kernel
sources).

SoCs from the Meson GX series are not affected by this change because
the input clock is FCLK_DIV2 whose rate cannot be changed (which is fine
since it's running at 1GHz, so it's already a multiple of 250MHz and
125MHz).

Fixes: 566e825162 ("net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC")
Suggested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:41:05 -05:00
Martin Blumenstingl 433c6cab9d net: stmmac: dwmac-meson8b: fix setting the RGMII TX clock on Meson8b
Meson8b only supports MPLL2 as clock input. The rate of the MPLL2 clock
set by Odroid-C1's u-boot is close to (but not exactly) 500MHz. The
exact rate is 500002394Hz, which is calculated in
drivers/clk/meson/clk-mpll.c using the following formula:
DIV_ROUND_UP_ULL((u64)parent_rate * SDM_DEN, (SDM_DEN * n2) + sdm)
Odroid-C1's u-boot configures MPLL2 with the following values:
- SDM_DEN = 16384
- SDM = 1638
- N2 = 5

The 250MHz clock (m250_div) inside dwmac-meson8b driver is derived from
the MPLL2 clock. Due to MPLL2 running slightly faster than 500MHz the
common clock framework chooses a divider which is too big to generate
the 250MHz clock (a divider of 2 would be needed, but this is rounded up
to a divider of 3). This breaks the RTL8211F RGMII PHY on Odroid-C1
because it requires a (close to) 125MHz RGMII TX clock (on Gbit speeds,
the IP block internally divides that down to 25MHz on 100Mbit/s
connections and 2.5MHz on 10Mbit/s connections - we don't need any
special configuration for that).

Round the divider to the closest value to prevent this issue on Meson8b.
This means we'll now end up with a clock rate for the RGMII TX clock of
125001197Hz (= 125MHz plus 1197Hz), which is close-enough to 125MHz.
This has no effect on the Meson GX SoCs since there fclk_div2 is used as
input clock, which has a rate of 1000MHz (and thus is divisible cleanly
to 250MHz and 125MHz).

Fixes: 566e825162 ("net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC")
Reported-by: Emiliano Ingrassia <ingrassia@epigenesys.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Reviewed-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:41:05 -05:00
Martin Blumenstingl 4f6a71b84e net: stmmac: dwmac-meson8b: fix internal RGMII clock configuration
Tests (using an oscilloscope and an Odroid-C1 board with a RTL8211F
RGMII PHY) have shown that the PRG_ETH0 register behaves as follows:
- bit 4 is a mux to choose between two parent clocks. according to the
  public S805 datasheet the only supported parent clock is MPLL2 (this
  was not verified using the oscilloscope).
  The public S805/S905 datasheet claims that this bit is reserved.
- bits 9:7 control a one-based divider (register value 1 means "divide
  by 1", etc.) for the input clock. we call this clock the "m250_div"
  clock because it's value is always supposed to be (close to) 250MHz
  (see below for an explanation).
  The description in the public S805/S905 datasheet is a bit cryptic,
  but it comes down to "input clock = 250MHz * value" (which could also
  be expressed as "250MHz = input clock / value")
- there seems to be an internal fixed divide-by-2 clock which takes the
  output from the m250_div and divides it by 2. This is not unusual on
  Amlogic SoCs, since the SDIO (MMC) driver also uses an internal fixed
  divide-by-2 clock.
  This is not documented in the public S805/S905 datasheet
- bit 10 controls a gate clock which enables or disables the RGMII TX
  clock (which is an output on the MAC/SoC and an input in the PHY). we
  call this the "rgmii_tx_en" clock. if this bit is set to "0" the RGMII
  TX clock output is close to 0
  The description for this bit in the public S805/S905 datasheet is
  "Generate 25MHz clock for PHY". Based on these tests it's believed
  that this is wrong, and should probably read "Generate the 125MHz
  RGMII TX clock for the PHY"
- the RGMII TX clock has to be set to 125MHz - the IP block adjusts the
  output (automatically) depending on the line speed (RGMII specifies
  that Gbit connections use a 125MHz clock, 100Mbit/s connections use a
  25MHz clock and 10Mbit/s connections use a 2.5MHz clock. only Gbit and
  100Mbit/s were tested with an oscilloscope). Due to the requirement
  that this clock always has to be set to 125MHz and due to the fixed
  divide-by-2 parent clock this means that m250_div will always end up
  with a rate of (close to) 250MHz.
- bits 6:5 are the TX delay, which is also named "clock phase" in some
  of Amlogic's older GPL kernel sources.

The PHY also has an XTAL_IN pin where a 25MHz clock has to be provided.
Tests with the oscilloscope have shown that this is routed to a crystal
right next to the RTL8211F PHY. The same seems to be true on the Khadas
VIM2 (which uses a GXM SoC) board - however the 25MHz crystal is on the
other side of the PCB there.

This updates the clocks in the dwmac-meson8b driver by replacing the
"m25_div" with the "rgmii_tx_en" clock and additionally introducing a
fixed divide-by-2 clock between "m250_div" and "rgmii_tx_en".
Now we also need to set a frequency of 125MHz on the RGMII clock
(opposed to the 25MHz we set before, with that non-existing
divide-by-5-or-10 divider).

Special thanks go to Linus Lüssing for testing the various bits and
checking the results with an oscilloscope on his Odroid-C1!

Fixes: 566e825162 ("net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC")
Reported-by: Emiliano Ingrassia <ingrassia@epigenesys.com>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Acked-by: Jerome Brunet <jbrunet@baylibre.com>
Tested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:41:05 -05:00
Martin Blumenstingl 37512b42f0 net: stmmac: dwmac-meson8b: only configure the clocks in RGMII mode
Neither the m25_div_clk nor the m250_div_clk or m250_mux_clk are used in
RMII mode. The m25_div_clk output is routed to the RGMII PHY's "RGMII
clock".
This means that we don't need to configure the clocks in RMII mode. The
driver however did this - with no effect since the clocks are not routed
to the PHY in RMII mode.

While here also rename meson8b_init_clk to meson8b_init_rgmii_tx_clk to
make it easier to understand the code.

Fixes: 566e825162 ("net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Tested-by: Jerome Brunet <jbrunet@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:41:05 -05:00
Jakub Kicinski 416ef9b15c net: sched: red: don't reset the backlog on every stat dump
Commit 0dfb33a0d7 ("sch_red: report backlog information") copied
child's backlog into RED's backlog.  Back then RED did not maintain
its own backlog counts.  This has changed after commit 2ccccf5fb4
("net_sched: update hierarchical backlog too") and commit d7f4f332f0
("sch_red: update backlog as well").  Copying is no longer necessary.

Tested:

$ tc -s qdisc show dev veth0
qdisc red 1: root refcnt 2 limit 400000b min 30000b max 30000b ecn
 Sent 20942 bytes 221 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 1260b 14p requeues 14
  marked 0 early 0 pdrop 0 other 0
qdisc tbf 2: parent 1: rate 1Kbit burst 15000b lat 3585.0s
 Sent 20942 bytes 221 pkt (dropped 0, overlimits 138 requeues 0)
 backlog 1260b 14p requeues 14

Recently RED offload was added.  We need to make sure drivers don't
depend on resetting the stats.  This means backlog should be treated
like any other statistic:

  total_stat = new_hw_stat - prev_hw_stat;

Adjust mlxsw.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Nogah Frankel <nogahf@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 14:29:32 -05:00
Saeed Mahameed 2d83619de8 net/mlx5: Fix build break
The latest merge between net and net-next introduced a complier assert in
mlx5 driver.  In hca_cap_bits older fields are kept along with newer
fields that should have replaced them.

Fixes: c02b3741eb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 11:24:28 -05:00
David S. Miller c02b3741eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes all over.

The mini-qdisc bits were a little bit tricky, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 00:10:42 -05:00