OpenCloudOS-Kernel/include/uapi/linux/neighbour.h

219 lines
5.7 KiB
C
Raw Normal View History

License cleanup: add SPDX license identifier to uapi header files with no license Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license. By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". otherwise syscall usage would not be possible. Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 22:08:43 +08:00
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef __LINUX_NEIGHBOUR_H
#define __LINUX_NEIGHBOUR_H
#include <linux/types.h>
#include <linux/netlink.h>
struct ndmsg {
__u8 ndm_family;
__u8 ndm_pad1;
__u16 ndm_pad2;
__s32 ndm_ifindex;
__u16 ndm_state;
__u8 ndm_flags;
__u8 ndm_type;
};
enum {
NDA_UNSPEC,
NDA_DST,
NDA_LLADDR,
NDA_CACHEINFO,
NDA_PROBES,
NDA_VLAN,
NDA_PORT,
NDA_VNI,
NDA_IFINDEX,
NDA_MASTER,
NDA_LINK_NETNSID,
vxlan: support fdb and learning in COLLECT_METADATA mode Vxlan COLLECT_METADATA mode today solves the per-vni netdev scalability problem in l3 networks. It expects all forwarding information to be present in dst_metadata. This patch series enhances collect metadata mode to include the case where only vni is present in dst_metadata, and the vxlan driver can then use the rest of the forwarding information datbase to make forwarding decisions. There is no change to default COLLECT_METADATA behaviour. These changes only apply to COLLECT_METADATA when used with the bridging use-case with a special dst_metadata tunnel info flag (eg: where vxlan device is part of a bridge). For all this to work, the vxlan driver will need to now support a single fdb table hashed by mac + vni. This series essentially makes this happen. use-case and workflow: vxlan collect metadata device participates in bridging vlan to vn-segments. Bridge driver above the vxlan device, sends the vni corresponding to the vlan in the dst_metadata. vxlan driver will lookup forwarding database with (mac + vni) for the required remote destination information to forward the packet. Changes introduced by this patch: - allow learning and forwarding database state in vxlan netdev in COLLECT_METADATA mode. Current behaviour is not changed by default. tunnel info flag IP_TUNNEL_INFO_BRIDGE is used to support the new bridge friendly mode. - A single fdb table hashed by (mac, vni) to allow fdb entries with multiple vnis in the same fdb table - rx path already has the vni - tx path expects a vni in the packet with dst_metadata - prior to this series, fdb remote_dsts carried remote vni and the vxlan device carrying the fdb table represented the source vni. With the vxlan device now representing multiple vnis, this patch adds a src vni attribute to the fdb entry. The remote vni already uses NDA_VNI attribute. This patch introduces NDA_SRC_VNI netlink attribute to represent the src vni in a multi vni fdb table. iproute2 example (patched and pruned iproute2 output to just show relevant fdb entries): example shows same host mac learnt on two vni's. before (netdev per vni): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan1000 dst 12.0.0.8 self after this patch with collect metadata in bridged mode (single netdev): $bridge fdb show | grep "00:02:00:00:00:03" 00:02:00:00:00:03 dev vxlan0 src_vni 1001 dst 12.0.0.8 self 00:02:00:00:00:03 dev vxlan0 src_vni 1000 dst 12.0.0.8 self Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-01 14:59:52 +08:00
NDA_SRC_VNI,
NDA_PROTOCOL, /* Originator of entry */
vxlan: ecmp support for mac fdb entries Todays vxlan mac fdb entries can point to multiple remote ips (rdsts) with the sole purpose of replicating broadcast-multicast and unknown unicast packets to those remote ips. E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (E-VPN multihoming is analogous to multi-homed LAG implementations, but with the inter-switch peerlink replaced with a vxlan tunnel). In other words it needs support for mac ecmp. Furthermore, for faster convergence, E-VPN multihoming needs the ability to update fdb ecmp nexthops independent of the fdb entries. New route nexthop API is perfect for this usecase. This patch extends the vxlan fdb code to take a nexthop id pointing to an ecmp nexthop group. Changes include: - New NDA_NH_ID attribute for fdbs - Use the newly added fdb nexthop groups - makes vxlan rdsts and nexthop handling code mutually exclusive - since this is a new use-case and the requirement is for ecmp nexthop groups, the fdb add and update path checks that the nexthop is really an ecmp nexthop group. This check can be relaxed in the future, if we want to introduce replication fdb nexthop groups and allow its use in lieu of current rdst lists. - fdb update requests with nexthop id's only allowed for existing fdb's that have nexthop id's - learning will not override an existing fdb entry with nexthop group - I have wrapped the switchdev offload code around the presence of rdst [1] E-VPN RFC https://tools.ietf.org/html/rfc7432 [2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365 [3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf Includes a null check fix in vxlan_xmit from Nikolay v2 - Fixed build issue: Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 13:26:14 +08:00
NDA_NH_ID,
NDA_FDB_EXT_ATTRS,
NDA_FLAGS_EXT,
NDA_NDM_STATE_MASK,
NDA_NDM_FLAGS_MASK,
__NDA_MAX
};
#define NDA_MAX (__NDA_MAX - 1)
/*
* Neighbor Cache Entry Flags
*/
net, neigh: Add NTF_MANAGED flag for managed neighbor entries Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED flag. The flag then indicates to the kernel that the neighbor entry should be periodically probed for keeping the entry in NUD_REACHABLE state iff possible. The use case for this is targeting XDP or tc BPF load-balancers which use the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution for their backends. Given they cannot be resolved in fast-path, a control plane inserts the L3 (without L2) entries manually into the neighbor table and lets the kernel do the neighbor resolution either on the gateway or on the backend directly in case the latter resides in the same L2. This avoids to deal with L2 in the control plane and to rebuild what the kernel already does best anyway. NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor table which gets triggered by the system work queue to periodically call neigh_event_send() for performing the resolution. The implementation allows migration from/to NTF_MANAGED neighbor entries, so that already existing entries can be converted by the control plane if needed. Potentially, we could make the interval for periodically calling neigh_event_send() configurable; right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which has similar driver-internal infrastructure c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's neigh table"). In future, the latter could possibly reuse the NTF_MANAGED neighbors as well. Example: # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Link: https://linuxplumbersconf.org/event/11/contributions/953/ Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-11 20:12:38 +08:00
#define NTF_USE (1 << 0)
#define NTF_SELF (1 << 1)
#define NTF_MASTER (1 << 2)
#define NTF_PROXY (1 << 3) /* == ATF_PUBL */
#define NTF_EXT_LEARNED (1 << 4)
#define NTF_OFFLOADED (1 << 5)
#define NTF_STICKY (1 << 6)
#define NTF_ROUTER (1 << 7)
/* Extended flags under NDA_FLAGS_EXT: */
#define NTF_EXT_MANAGED (1 << 0)
net: add generic PF_BRIDGE:RTM_ FDB hooks This adds two new flags NTF_MASTER and NTF_SELF that can now be used to specify where PF_BRIDGE netlink commands should be sent. NTF_MASTER sends the commands to the 'dev->master' device for parsing. Typically this will be the linux net/bridge, or open-vswitch devices. Also without any flags set the command will be handled by the master device as well so that current user space tools continue to work as expected. The NTF_SELF flag will push the PF_BRIDGE commands to the device. In the basic example below the commands are then parsed and programmed in the embedded bridge. Note if both NTF_SELF and NTF_MASTER bits are set then the command will be sent to both 'dev->master' and 'dev' this allows user space to easily keep the embedded bridge and software bridge in sync. There is a slight complication in the case with both flags set when an error occurs. To resolve this the rtnl handler clears the NTF_ flag in the netlink ack to indicate which sets completed successfully. The add/del handlers will abort as soon as any error occurs. To support this new net device ops were added to call into the device and the existing bridging code was refactored to use these. There should be no required changes in user space to support the current bridge behavior. A basic setup with a SR-IOV enabled NIC looks like this, veth0 veth2 | | ------------ | bridge0 | <---- software bridging ------------ / / ethx.y ethx VF PF \ \ <---- propagate FDB entries to HW \ \ -------------------- | Embedded Bridge | <---- hardware offloaded switching -------------------- In this case the embedded bridge must be managed to allow 'veth0' to communicate with 'ethx.y' correctly. At present drivers managing the embedded bridge either send frames onto the network which then get dropped by the switch OR the embedded bridge will flood these frames. With this patch we have a mechanism to manage the embedded bridge correctly from user space. This example is specific to SR-IOV but replacing the VF with another PF or dropping this into the DSA framework generates similar management issues. Examples session using the 'br'[1] tool to add, dump and then delete a mac address with a new "embedded" option and enabled ixgbe driver: # br fdb add 22:35:19:ac:60:59 dev eth3 # br fdb port mac addr flags veth0 22:35:19:ac:60:58 static veth0 9a:5f:81:f7:f6:ec local eth3 00:1b:21:55:23:59 local eth3 22:35:19:ac:60:59 static veth0 22:35:19:ac:60:57 static #br fdb add 22:35:19:ac:60:59 embedded dev eth3 #br fdb port mac addr flags veth0 22:35:19:ac:60:58 static veth0 9a:5f:81:f7:f6:ec local eth3 00:1b:21:55:23:59 local eth3 22:35:19:ac:60:59 static veth0 22:35:19:ac:60:57 static eth3 22:35:19:ac:60:59 local embedded #br fdb del 22:35:19:ac:60:59 embedded dev eth3 I added a couple lines to 'br' to set the flags correctly is all. It is my opinion that the merit of this patch is now embedded and SW bridges can both be modeled correctly in user space using very nearly the same message passing. [1] 'br' tool was published as an RFC here and will be renamed 'bridge' http://patchwork.ozlabs.org/patch/117664/ Thanks to Jamal Hadi Salim, Stephen Hemminger and Ben Hutchings for valuable feedback, suggestions, and review. v2: fixed api descriptions and error case with both NTF_SELF and NTF_MASTER set plus updated patch description. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-15 14:43:56 +08:00
/*
* Neighbor Cache Entry States.
*/
#define NUD_INCOMPLETE 0x01
#define NUD_REACHABLE 0x02
#define NUD_STALE 0x04
#define NUD_DELAY 0x08
#define NUD_PROBE 0x10
#define NUD_FAILED 0x20
/* Dummy states */
#define NUD_NOARP 0x40
#define NUD_PERMANENT 0x80
#define NUD_NONE 0x00
net, neigh: Add NTF_MANAGED flag for managed neighbor entries Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED flag. The flag then indicates to the kernel that the neighbor entry should be periodically probed for keeping the entry in NUD_REACHABLE state iff possible. The use case for this is targeting XDP or tc BPF load-balancers which use the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution for their backends. Given they cannot be resolved in fast-path, a control plane inserts the L3 (without L2) entries manually into the neighbor table and lets the kernel do the neighbor resolution either on the gateway or on the backend directly in case the latter resides in the same L2. This avoids to deal with L2 in the control plane and to rebuild what the kernel already does best anyway. NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor table which gets triggered by the system work queue to periodically call neigh_event_send() for performing the resolution. The implementation allows migration from/to NTF_MANAGED neighbor entries, so that already existing entries can be converted by the control plane if needed. Potentially, we could make the interval for periodically calling neigh_event_send() configurable; right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which has similar driver-internal infrastructure c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's neigh table"). In future, the latter could possibly reuse the NTF_MANAGED neighbors as well. Example: # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Link: https://linuxplumbersconf.org/event/11/contributions/953/ Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-11 20:12:38 +08:00
/* NUD_NOARP & NUD_PERMANENT are pseudostates, they never change and make no
* address resolution or NUD.
*
* NUD_PERMANENT also cannot be deleted by garbage collectors. This holds true
* for dynamic entries with NTF_EXT_LEARNED flag as well. However, upon carrier
* down event, NUD_PERMANENT entries are not flushed whereas NTF_EXT_LEARNED
* flagged entries explicitly are (which is also consistent with the routing
* subsystem).
*
* When NTF_EXT_LEARNED is set for a bridge fdb entry the different cache entry
* states don't make sense and thus are ignored. Such entries don't age and
* can roam.
net, neigh: Add NTF_MANAGED flag for managed neighbor entries Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED flag. The flag then indicates to the kernel that the neighbor entry should be periodically probed for keeping the entry in NUD_REACHABLE state iff possible. The use case for this is targeting XDP or tc BPF load-balancers which use the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution for their backends. Given they cannot be resolved in fast-path, a control plane inserts the L3 (without L2) entries manually into the neighbor table and lets the kernel do the neighbor resolution either on the gateway or on the backend directly in case the latter resides in the same L2. This avoids to deal with L2 in the control plane and to rebuild what the kernel already does best anyway. NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor table which gets triggered by the system work queue to periodically call neigh_event_send() for performing the resolution. The implementation allows migration from/to NTF_MANAGED neighbor entries, so that already existing entries can be converted by the control plane if needed. Potentially, we could make the interval for periodically calling neigh_event_send() configurable; right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which has similar driver-internal infrastructure c723c735fa6b ("mlxsw: spectrum_router: Periodically update the kernel's neigh table"). In future, the latter could possibly reuse the NTF_MANAGED neighbors as well. Example: # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn # ./ip/ip n 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Link: https://linuxplumbersconf.org/event/11/contributions/953/ Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-11 20:12:38 +08:00
*
* NTF_EXT_MANAGED flagged neigbor entries are managed by the kernel on behalf
* of a user space control plane, and automatically refreshed so that (if
* possible) they remain in NUD_REACHABLE state.
*/
struct nda_cacheinfo {
__u32 ndm_confirmed;
__u32 ndm_used;
__u32 ndm_updated;
__u32 ndm_refcnt;
};
/*****************************************************************
* Neighbour tables specific messages.
*
* To retrieve the neighbour tables send RTM_GETNEIGHTBL with the
* NLM_F_DUMP flag set. Every neighbour table configuration is
* spread over multiple messages to avoid running into message
* size limits on systems with many interfaces. The first message
* in the sequence transports all not device specific data such as
* statistics, configuration, and the default parameter set.
* This message is followed by 0..n messages carrying device
* specific parameter sets.
* Although the ordering should be sufficient, NDTA_NAME can be
* used to identify sequences. The initial message can be identified
* by checking for NDTA_CONFIG. The device specific messages do
* not contain this TLV but have NDTPA_IFINDEX set to the
* corresponding interface index.
*
* To change neighbour table attributes, send RTM_SETNEIGHTBL
* with NDTA_NAME set. Changeable attribute include NDTA_THRESH[1-3],
* NDTA_GC_INTERVAL, and all TLVs in NDTA_PARMS unless marked
* otherwise. Device specific parameter sets can be changed by
* setting NDTPA_IFINDEX to the interface index of the corresponding
* device.
****/
struct ndt_stats {
__u64 ndts_allocs;
__u64 ndts_destroys;
__u64 ndts_hash_grows;
__u64 ndts_res_failed;
__u64 ndts_lookups;
__u64 ndts_hits;
__u64 ndts_rcv_probes_mcast;
__u64 ndts_rcv_probes_ucast;
__u64 ndts_periodic_gc_runs;
__u64 ndts_forced_gc_runs;
__u64 ndts_table_fulls;
};
enum {
NDTPA_UNSPEC,
NDTPA_IFINDEX, /* u32, unchangeable */
NDTPA_REFCNT, /* u32, read-only */
NDTPA_REACHABLE_TIME, /* u64, read-only, msecs */
NDTPA_BASE_REACHABLE_TIME, /* u64, msecs */
NDTPA_RETRANS_TIME, /* u64, msecs */
NDTPA_GC_STALETIME, /* u64, msecs */
NDTPA_DELAY_PROBE_TIME, /* u64, msecs */
NDTPA_QUEUE_LEN, /* u32 */
NDTPA_APP_PROBES, /* u32 */
NDTPA_UCAST_PROBES, /* u32 */
NDTPA_MCAST_PROBES, /* u32 */
NDTPA_ANYCAST_DELAY, /* u64, msecs */
NDTPA_PROXY_DELAY, /* u64, msecs */
NDTPA_PROXY_QLEN, /* u32 */
NDTPA_LOCKTIME, /* u64, msecs */
neigh: new unresolved queue limits Le mercredi 09 novembre 2011 à 16:21 -0500, David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Wed, 09 Nov 2011 16:16:44 -0500 (EST) > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Wed, 09 Nov 2011 12:14:09 +0100 > > > >> unres_qlen is the number of frames we are able to queue per unresolved > >> neighbour. Its default value (3) was never changed and is responsible > >> for strange drops, especially if IP fragments are used, or multiple > >> sessions start in parallel. Even a single tcp flow can hit this limit. > > ... > > > > Ok, I've applied this, let's see what happens :-) > > Early answer, build fails. > > Please test build this patch with DECNET enabled and resubmit. The > decnet neigh layer still refers to the removed ->queue_len member. > > Thanks. Ouch, this was fixed on one machine yesterday, but not the other one I used this morning, sorry. [PATCH V5 net-next] neigh: new unresolved queue limits unres_qlen is the number of frames we are able to queue per unresolved neighbour. Its default value (3) was never changed and is responsible for strange drops, especially if IP fragments are used, or multiple sessions start in parallel. Even a single tcp flow can hit this limit. $ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108 PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data. 8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09 20:07:14 +08:00
NDTPA_QUEUE_LENBYTES, /* u32 */
NDTPA_MCAST_REPROBES, /* u32 */
NDTPA_PAD,
NDTPA_INTERVAL_PROBE_TIME_MS, /* u64, msecs */
__NDTPA_MAX
};
#define NDTPA_MAX (__NDTPA_MAX - 1)
struct ndtmsg {
__u8 ndtm_family;
__u8 ndtm_pad1;
__u16 ndtm_pad2;
};
struct ndt_config {
__u16 ndtc_key_len;
__u16 ndtc_entry_size;
__u32 ndtc_entries;
__u32 ndtc_last_flush; /* delta to now in msecs */
__u32 ndtc_last_rand; /* delta to now in msecs */
__u32 ndtc_hash_rnd;
__u32 ndtc_hash_mask;
__u32 ndtc_hash_chain_gc;
__u32 ndtc_proxy_qlen;
};
enum {
NDTA_UNSPEC,
NDTA_NAME, /* char *, unchangeable */
NDTA_THRESH1, /* u32 */
NDTA_THRESH2, /* u32 */
NDTA_THRESH3, /* u32 */
NDTA_CONFIG, /* struct ndt_config, read-only */
NDTA_PARMS, /* nested TLV NDTPA_* */
NDTA_STATS, /* struct ndt_stats, read-only */
NDTA_GC_INTERVAL, /* u64, msecs */
NDTA_PAD,
__NDTA_MAX
};
#define NDTA_MAX (__NDTA_MAX - 1)
/* FDB activity notification bits used in NFEA_ACTIVITY_NOTIFY:
* - FDB_NOTIFY_BIT - notify on activity/expire for any entry
* - FDB_NOTIFY_INACTIVE_BIT - mark as inactive to avoid multiple notifications
*/
enum {
FDB_NOTIFY_BIT = (1 << 0),
FDB_NOTIFY_INACTIVE_BIT = (1 << 1)
};
/* embedded into NDA_FDB_EXT_ATTRS:
* [NDA_FDB_EXT_ATTRS] = {
* [NFEA_ACTIVITY_NOTIFY]
* ...
* }
*/
enum {
NFEA_UNSPEC,
NFEA_ACTIVITY_NOTIFY,
NFEA_DONT_REFRESH,
__NFEA_MAX
};
#define NFEA_MAX (__NFEA_MAX - 1)
#endif