OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
Colin Ian King	5c17a07cff	net: dsa: bcm_sf2: remove redundant variable off Variable 'off' is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'off' set but not used [-Wunused-but-set-variable] Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:39:10 +09:00
David S. Miller	d11a899ff0	Merge branch 'Scheduled-packet-Transmission-ETF' Jesus Sanchez-Palencia says: ==================== Scheduled packet Transmission: ETF Changes since v1: - moved struct sock_txtime from socket.h to uapi net_tstamp.h; - sk_clockid was changed from u16 to u8; - sk_txtime_flags was changed from u16 to a u8 bit field in struct sock; - the socket option flags are now validated in sock_setsockopt(); - added SO_EE_ORIGIN_TXTIME; - sockc.transmit_time is now initialized from all IPv4 Tx paths; - added support for the IPv6 Tx path; Overview ======== This work consists of a set of kernel interfaces that can be used by applications that require (time-based) Scheduled Tx of packets. It is comprised by 3 new components to the kernel: - SO_TXTIME: socket option + cmsg programming interfaces. - etf: the "earliest txtime first" qdisc, that provides per-queue TxTime-based scheduling. This has been renamed from 'tbs' to 'etf' to better describe its functionality. - taprio: the "time-aware priority scheduler" qdisc, that provides per-port Time-Aware scheduling; This patchset is providing the first 2 components, which have been developed for longer. The taprio qdisc will be shared as an RFC separately (shortly). Note that this series is a follow up of the "Time based packet transmission" RFCv3 [1]. etf (formerly known as 'tbs') ============================= For applications/systems that the concept of time slices isn't precise enough, the etf qdisc allows applications to control the instant when a packet should leave the network controller. When used in conjunction with taprio, it can also be used in case the application needs to control with greater guarantee the offset into each time slice a packet will be sent. Another use case of etf, is when only a small number of applications on a system are time sensitive, so it can then be used with a more traditional root qdisc (like mqprio). The etf qdisc is designed so it buffers packets until a configurable time before their deadline (Tx time). The qdisc uses a rbtree internally so the buffered packets are always 'ordered' by their txtime (deadline) and will be dequeued following the earliest txtime first. It relies on the SO_TXTIME API set for receiving the per-packet timestamp (txtime) as well as the config flags for each socket: the clockid to be used as a reference, if the expected mode of txtime for that socket is deadline or strict mode, and if packet drops should be reported on the socket's error queue or not. The qdisc will drop any packets with a Tx time in the past, or if a packet expires while waiting for being dequeued. Drops can be reported as errors back to userspace through the socket's error queue. Example configuration: $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 200000 \ clockid CLOCK_TAI Here, the Qdisc will use HW offload for the txtime control. Packets will be dequeued by the qdisc "delta" (200000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by hrtimers, the system clock and the PHC clock must be synchronized for this mode to behave as expected. A more complete example can be found here, with instructions of how to test it: https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f [2] Note that we haven't modified the qdisc so it uses a timerqueue because the modification needed was increasing the number of cachelines of a sk_buff. This series is also hosted on github and can be found at [3]. The companion iproute2 patches can be found at [4]. [1] https://patchwork.ozlabs.org/cover/882342/ [2] github doesn't make it clear, but the gist can be cloned like this: $ git clone https://gist.github.com/jeez/bd3afeff081ba64a695008dd8215866f scheduled-tx-tests [3] https://github.com/jeez/linux/tree/etf-v2 [4] https://github.com/jeez/iproute2/tree/etf-v2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia	4b15c70753	net/sched: Make etf report drops on error_queue Use the socket error queue for reporting dropped packets if the socket has enabled that feature through the SO_TXTIME API. Packets are dropped either on enqueue() if they aren't accepted by the qdisc or on dequeue() if the system misses their deadline. Those are reported as different errors so applications can react accordingly. Userspace can retrieve the errors through the socket error queue and the corresponding cmsg interfaces. A struct sock_extended_err* is used for returning the error data, and the packet's timestamp can be retrieved by adding both ee_data and ee_info fields as e.g.: ((__u64) serr->ee_data << 32) + serr->ee_info This feature is disabled by default and must be explicitly enabled by applications. Enabling it can bring some overhead for the Tx cycles of the application. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia	3048cf84d1	igb: Add support for ETF offload Implement HW offload support for SO_TXTIME through igb's Launchtime feature. This is done by extending igb_setup_tc() so it supports TC_SETUP_QDISC_ETF and configuring i210 so time based transmit arbitration is enabled. The FQTSS transmission mode added before is extended so strict priority (SP) queues wait for stream reservation (SR) ones. igb_config_tx_modes() is extended so it can support enabling/disabling Launchtime following the previous approach used for the credit-based shaper (CBS). As the previous flow, FQTSS transmission mode is enabled automatically by the driver once Launchtime (or CBS, as before) is enabled. Similarly, it's automatically disabled when the feature is disabled for the last queue that had it setup on. The driver just consumes the transmit times from the skbuffs directly, so no special handling is done in case an 'invalid' time is provided. We assume this has been handled by the ETF qdisc already. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia	1b9231e7e1	igb: Only call skb_tx_timestamp after descriptors are ready Currently, skb_tx_timestamp() is being called before the Tx descriptors are prepared in igb_xmit_frame_ring(), which happens during either the igb_tso() or igb_tx_csum() calls. Given that now the skb->tstamp might be used to carry the timestamp for SO_TXTIME, we must only call skb_tx_timestamp() after the information has been copied into the Tx descriptors. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia	8080e6ab4e	igb: Refactor igb_offload_cbs() Split code into a separate function (igb_offload_apply()) that will be used by ETF offload implementation. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia	0364a0d0e7	igb: Only change Tx arbitration when CBS is on Currently the data transmission arbitration algorithm - DataTranARB field on TQAVCTRL reg - is always set to CBS when the Tx mode is changed from legacy to 'Qav' mode. Make that configuration a bit more granular in preparation for the upcoming Launchtime enabling patches, since CBS and Launchtime can be enabled separately. That is achieved by moving the DataTranARB setup to igb_config_tx_modes() instead. Similarly, when disabling CBS we must check if it has been disabled for all queues, and clear the DataTranARB accordingly. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:28 +09:00
Jesus Sanchez-Palencia	91db364236	igb: Refactor igb_configure_cbs() Make this function retrieve what it needs from the Tx ring being addressed since it already relies on what had been saved on it before. Also, since this function will be used by the upcoming Launchtime patches rename it to better reflect its intention. Note that Launchtime is not part of what 802.1Qav specifies, but the i210 datasheet refers to this set of functionality as "Qav Transmission Mode". Here we also perform a tiny refactor at is_any_cbs_enabled(), and add further documentation to igb_setup_tx_mode(). Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Jesus Sanchez-Palencia	88cab77162	net/sched: Add HW offloading capability to ETF Add infra so etf qdisc supports HW offload of time-based transmission. For hw offload, the time sorted list is still used, so packets are dequeued always in order of txtime. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf offload delta 100000 \ clockid CLOCK_REALTIME In this example, the Qdisc will use HW offload for the control of the transmission time through the network adapter. The hrtimer used for packets scheduling inside the qdisc will use the clockid CLOCK_REALTIME as reference and packets leave the Qdisc "delta" (100000) nanoseconds before their transmission time. Because this will be using HW offload and since dynamic clocks are not supported by the hrtimer, the system clock and the PHC clock must be synchronized for this mode to behave as expected. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Vinicius Costa Gomes	25db26a913	net/sched: Introduce the ETF Qdisc The ETF (Earliest TxTime First) qdisc uses the information added earlier in this series (the socket option SO_TXTIME and the new role of sk_buff->tstamp) to schedule packets transmission based on absolute time. For some workloads, just bandwidth enforcement is not enough, and precise control of the transmission of packets is necessary. Example: $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0 $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \ clockid CLOCK_TAI In this example, the Qdisc will provide SW best-effort for the control of the transmission time to the network adapter, the time stamp in the socket will be in reference to the clockid CLOCK_TAI and packets will leave the qdisc "delta" (100000) nanoseconds before its transmission time. The ETF qdisc will buffer packets sorted by their txtime. It will drop packets on enqueue() if their skbuff clockid does not match the clock reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped if it expires while being enqueued. The qdisc also supports the SO_TXTIME deadline mode. For this mode, it will dequeue a packet as soon as possible and change the skb timestamp to 'now' during etf_dequeue(). Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid to be configured, but it's been decided that usage of CLOCK_TAI should be enforced until we decide to allow for other clockids to be used. The rationale here is that PTP times are usually in the TAI scale, thus no other clocks should be necessary. For now, the qdisc will return EINVAL if any clocks other than CLOCK_TAI are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Vinicius Costa Gomes	860b642b9c	net/sched: Allow creating a Qdisc watchdog with other clocks This adds 'qdisc_watchdog_init_clockid()' that allows a clockid to be passed, this allows other time references to be used when scheduling the Qdisc to run. Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Richard Cochran	3d0ba8c03c	net: packet: Hook into time based transmission. For raw layer-2 packets, copy the desired future transmit time from the CMSG cookie into the skb. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Jesus Sanchez-Palencia	a818f75e31	net: ipv6: Hook into time based transmission Add a struct sockcm_cookie parameter to ip6_setup_cork() so we can easily re-use the transmit_time field from struct inet_cork for most paths, by copying the timestamp from the CMSG cookie. This is later copied into the skb during __ip6_make_skb(). For the raw fast path, also pass the sockcm_cookie as a parameter so we can just perform the copy at rawv6_send_hdrinc() directly. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Jesus Sanchez-Palencia	bc969a9778	net: ipv4: Hook into time based transmission Add a transmit_time field to struct inet_cork, then copy the timestamp from the CMSG cookie at ip_setup_cork() so we can safely copy it into the skb later during __ip_make_skb(). For the raw fast path, just perform the copy at raw_send_hdrinc(). Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Richard Cochran	80b14dee2b	net: Add a new socket option for a future transmit time. This patch introduces SO_TXTIME. User space enables this option in order to pass a desired future transmit time in a CMSG when calling sendmsg(2). The argument to this socket option is a 8-bytes long struct provided by the uapi header net_tstamp.h defined as: struct sock_txtime { clockid_t clockid; u32 flags; }; Note that new fields were added to struct sock by filling a 2-bytes hole found in the struct. For that reason, neither the struct size or number of cachelines were altered. Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Jesus Sanchez-Palencia	c47d8c2f38	net: Clear skb->tstamp only on the forwarding path This is done in preparation for the upcoming time based transmission patchset. Now that skb->tstamp will be used to hold packet's txtime, we must ensure that it is being cleared when traversing namespaces. Also, doing that from skb_scrub_packet() before the early return would break our feature when tunnels are used. Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:30:27 +09:00
Gustavo A. R. Silva	d287c50243	isdn: mark expected switch fall-throughs In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Warning level 2 was used: -Wimplicit-fallthrough=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:17:32 +09:00
Marcel Ziswiler	03fc5d4ffb	net: usb: asix: allow optionally getting mac address from device tree For Embedded use where e.g. AX88772B chips may be used without external EEPROMs the boot loader may choose to pass the MAC address to be used via device tree. Therefore, allow for optionally getting the MAC address from device tree data e.g. as follows (excerpt from a T30 based board, local-mac-address to be filled in by boot loader): /* EHCI instance 1: USB2_DP/N -> AX88772B */ usb@7d004000 { status = "okay"; #address-cells = <1>; #size-cells = <0>; asix@1 { reg = <1>; local-mac-address = [00 00 00 00 00 00]; }; }; Signed-off-by: Marcel Ziswiler <marcel.ziswiler@toradex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:09:23 +09:00
Wei Yongjun	30e99ed6db	net: sched: act_pedit: fix possible memory leak in tcf_pedit_init() 'keys_ex' is malloced by tcf_pedit_keys_ex_parse() in tcf_pedit_init() but not all of the error handle path free it, this may cause memory leak. This patch fix it. Fixes: `71d0ed7079` ("net/act_pedit: Support using offset relative to the conventional network headers") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 22:08:21 +09:00
David S. Miller	7184e7e7d9	Merge branch 'bridge-iproute2-isolated-port-and-selftests' Nikolay Aleksandrov says: ==================== bridge: iproute2 isolated port and selftests Add support to iproute2 for port isolation config and selftests for it. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:40:02 +09:00
Nikolay Aleksandrov	a14e9fafaa	selftests: forwarding: test for bridge port isolation This test checks if the bridge port isolation feature works as expected by performing ping/ping6 tests between hosts that are isolated (should not work) and between an isolated and non-isolated hosts (should work). Same test is performed for flooding from and to isolated and non-isolated ports. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:40:02 +09:00
Nikolay Aleksandrov	967450c543	selftests: forwarding: lib: extract ping and ping6 so they can be reused Extract ping and ping6 command execution so the return value can be checked by the caller, this is needed for port isolation tests that are intended to fail. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:40:02 +09:00
David S. Miller	f744c4bb5c	Merge branch 'vhost_net-Avoid-vq-kicks-during-busyloop' Toshiaki Makita says: ==================== vhost_net: Avoid vq kicks during busyloop Under heavy load vhost tx busypoll tend not to suppress vq kicks, which causes poor guest tx performance. The detailed scenario is described in commitlog of patch 2. Rx seems not to have that serious problem, but for consistency I made a similar change on rx to avoid rx wakeups (patch 3). Additionary patch 4 is to avoid rx kicks under heavy load during busypoll. Tx performance is greatly improved by this change. I don't see notable performance change on rx with this series though. Performance numbers (tx): - Bulk transfer from guest to external physical server. [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server] - Set 10us busypoll. - Guest disables checksum and TSO because of host XDP. - Measured single flow Mbps by netperf, and kicks by perf kvm stat (EPT_MISCONFIG event). Before After Mbps kicks/s Mbps kicks/s UDP_STREAM 1472byte 247758 27 Send 3645.37 6958.10 Recv 3588.56 6958.10 1byte 9865 37 Send 4.34 5.43 Recv 4.17 5.26 TCP_STREAM 8801.03 45794 9592.77 2884 v2: - Split patches into 3 parts (renaming variables, tx-kick fix, rx-wakeup fix). - Avoid rx-kicks too (patch 4). - Don't memorize endtime as it is not needed for now. ==================== Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:31:08 +09:00
Toshiaki Makita	6369fec5be	vhost_net: Avoid rx vring kicks during busyloop We may run out of avail rx ring descriptor under heavy load but busypoll did not detect it so busypoll may have exited prematurely. Avoid this by checking rx ring full during busypoll. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:30:47 +09:00
Toshiaki Makita	be294a51ad	vhost_net: Avoid rx queue wake-ups during busypoll We may run handle_rx() while rx work is queued. For example a packet can push the rx work during the window before handle_rx calls vhost_net_disable_vq(). In that case busypoll immediately exits due to vhost_has_work() condition and enables vq again. This can lead to another unnecessary rx wake-ups, so poll rx work instead of enabling the vq. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:30:46 +09:00
Toshiaki Makita	027b17603b	vhost_net: Avoid tx vring kicks during busyloop Under heavy load vhost busypoll may run without suppressing notification. For example tx zerocopy callback can push tx work while handle_tx() is running, then busyloop exits due to vhost_has_work() condition and enables notification but immediately reenters handle_tx() because the pushed work was tx. In this case handle_tx() tries to disable notification again, but when using event_idx it by design cannot. Then busyloop will run without suppressing notification. Another example is the case where handle_tx() tries to enable notification but avail idx is advanced so disables it again. This case also leads to the same situation with event_idx. The problem is that once we enter this situation busyloop does not work under heavy load for considerable amount of time, because notification is likely to happen during busyloop and handle_tx() immediately enables notification after notification happens. Specifically busyloop detects notification by vhost_has_work() and then handle_tx() calls vhost_enable_notify(). Because the detected work was the tx work, it enters handle_tx(), and enters busyloop without suppression again. This is likely to be repeated, so with event_idx we are almost not able to suppress notification in this case. To fix this, poll the work instead of enabling notification when busypoll is interrupted by something. IMHO vhost_has_work() is kind of interruption rather than a signal to completely cancel the busypoll, so let's run busypoll after the necessary work is done. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:30:46 +09:00
Toshiaki Makita	28b9b33b98	vhost_net: Rename local variables in vhost_net_rx_peek_head_len So we can easily see which variable is for which, tx or rx. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:30:46 +09:00
Qiaobin Fu	e7e3728bd7	net:sched: add action inheritdsfield to skbedit The new action inheritdsfield copies the field DS of IPv4 and IPv6 packets into skb->priority. This enables later classification of packets based on the DS field. v5: Update the drop counter for TC_ACT_SHOT v4: Not allow setting flags other than the expected ones. Allow dumping the pure flags. v3: Use optional flags, so that it won't break old versions of tc. Allow users to set both SKBEDIT_F_PRIORITY and SKBEDIT_F_INHERITDSFIELD flags. v2: Fix the style issue *Move the code from skbmod to skbedit Original idea by Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Qiaobin Fu <qiaobinf@bu.edu> Reviewed-by: Michel Machado <michel@digirati.com.br> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 21:27:42 +09:00
David S. Miller	f145b0a707	Merge branch 'More-mirror-to-gretap-tests-with-bridge-in-UL' Petr Machata says: ==================== More mirror-to-gretap tests with bridge in UL This patchset adds two more tests where the mirror-to-gretap has a bridge in underlay packet path, without a VLAN above or below that bridge. In patch #1, a non-VLAN-filtering bridge is tested. In patch #2, a VLAN-filtering bridge is tested. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:18:46 +09:00
Petr Machata	239e754af8	selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q Test for "tc action mirred egress mirror" that mirrors to gretap when the underlay route points at a VLAN-aware bridge (802.1q). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:18:45 +09:00
Petr Machata	35c31d5c32	selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d Test for "tc action mirred egress mirror" that mirrors to gretap when the underlay route points at a VLAN-unaware bridge (802.1d). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:18:45 +09:00
David S. Miller	2d1b138505	Merge branch 'Handle-multiple-received-packets-at-each-stage' Edward Cree says: ==================== Handle multiple received packets at each stage This patch series adds the capability for the network stack to receive a list of packets and process them as a unit, rather than handling each packet singly in sequence. This is done by factoring out the existing datapath code at each layer and wrapping it in list handling code. The motivation for this change is twofold: * Instruction cache locality. Currently, running the entire network stack receive path on a packet involves more code than will fit in the lowest-level icache, meaning that when the next packet is handled, the code has to be reloaded from more distant caches. By handling packets in "row-major order", we ensure that the code at each layer is hot for most of the list. (There is a corresponding downside in _data_ cache locality, since we are now touching every packet at every layer, but in practice there is easily enough room in dcache to hold one cacheline of each of the 64 packets in a NAPI poll.) * Reduction of indirect calls. Owing to Spectre mitigations, indirect function calls are now more expensive than ever; they are also heavily used in the network stack's architecture (see [1]). By replacing 64 indirect calls to the next-layer per-packet function with a single indirect call to the next-layer list function, we can save CPU cycles. Drivers pass an SKB list to the stack at the end of the NAPI poll; this gives a natural batch size (the NAPI poll weight) and avoids waiting at the software level for further packets to make a larger batch (which would add latency). It also means that the batch size is automatically tuned by the existing interrupt moderation mechanism. The stack then runs each layer of processing over all the packets in the list before proceeding to the next layer. Where the 'next layer' (or the context in which it must run) differs among the packets, the stack splits the list; this 'late demux' means that packets which differ only in later headers (e.g. same L2/L3 but different L4) can traverse the early part of the stack together. Also, where the next layer is not (yet) list-aware, the stack can revert to calling the rest of the stack in a loop; this allows gradual/creeping listification, with no 'flag day' patch needed to listify everything. Patches 1-2 simply place received packets on a list during the event processing loop on the sfc EF10 architecture, then call the normal stack for each packet singly at the end of the NAPI poll. (Analogues of patch #2 for other NIC drivers should be fairly straightforward.) Patches 3-9 extend the list processing as far as the IP receive handler. Patches 1-2 alone give about a 10% improvement in packet rate in the baseline test; adding patches 3-9 raises this to around 25%. Performance measurements were made with NetPerf UDP_STREAM, using 1-byte packets and a single core to handle interrupts on the RX side; this was in order to measure as simply as possible the packet rate handled by a single core. Figures are in Mbit/s; divide by 8 to obtain Mpps. The setup was tuned for maximum reproducibility, rather than raw performance. Full details and more results (both with and without retpolines) from a previous version of the patch series are presented in [2]. The baseline test uses four streams, and multiple RXQs all bound to a single CPU (the netperf binary is bound to a neighbouring CPU). These tests were run with retpolines. net-next: 6.91 Mb/s (datum) after 9: 8.46 Mb/s (+22.5%) Note however that these results are not robust; changes in the parameters of the test sometimes shrink the gain to single-digit percentages. For instance, when using only a single RXQ, only a 4% gain was seen. One test variation was the use of software filtering/firewall rules. Adding a single iptables rule (UDP port drop on a port range not matching the test traffic), thus making the netfilter hook have work to do, reduced baseline performance but showed a similar gain from the patches: net-next: 5.02 Mb/s (datum) after 9: 6.78 Mb/s (+35.1%) Similarly, testing with a set of TC flower filters (kindly supplied by Cong Wang) gave the following: net-next: 6.83 Mb/s (datum) after 9: 8.86 Mb/s (+29.7%) These data suggest that the batching approach remains effective in the presence of software switching rules, and perhaps even improves the performance of those rules by allowing them and their codepaths to stay in cache between packets. Changes from v3: * Fixed build error when CONFIG_NETFILTER=n (thanks kbuild). Changes from v2: * Used standard list handling (and skb->list) instead of the skb-queue functions (that use skb->next, skb->prev). - As part of this, changed from a "dequeue, process, enqueue" model to using list_for_each_safe, list_del, and (new) list_cut_before. * Altered __netif_receive_skb_core() changes in patch 6 as per Willem de Bruijn's suggestions (separate *ppt_prev from pt_prev; renaming). * Removed patches to Generic XDP, since they were producing no benefit. I may revisit them later. * Removed RFC tags. Changes from v1: * Rebased across 2 years' net-next movement (surprisingly straightforward). - Added Generic XDP handling to netif_receive_skb_list_internal() - Dealt with changes to PFMEMALLOC setting APIs * General cleanup of code and comments. * Skipped function calls for empty lists at various points in the stack (patch #9). * Added listified Generic XDP handling (patches 10-12), though it doesn't seem to help (see above). * Extended testing to cover software firewalls / netfilter etc. [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:20 +09:00
Edward Cree	b9f463d6c9	net: don't bother calling list RX functions on empty lists Generally the check should be very cheap, as the sk_buff_head is in cache. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:20 +09:00
Edward Cree	5fa12739a5	net: ipv4: listify ip_rcv_finish ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early demux or route lookup. The last step, calling dst_input(skb), is left to the caller; in the listified case, we split to form sublists with a common dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop. The next step in listification would thus be to add a list_input() method to struct dst_entry. Early demux is an indirect call based on iph->protocol; this is another opportunity for listification which is not taken here (it would require slicing up ip_rcv_finish_core() to allow splitting on protocol changes). Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:20 +09:00
Edward Cree	17266ee939	net: ipv4: listified version of ip_rcv Also involved adding a way to run a netfilter hook over a list of packets. Rather than attempting to make netfilter know about lists (which would be a major project in itself) we just let it call the regular okfn (in this case ip_rcv_finish()) for any packets it steals, and have it give us back a list of packets it's synchronously accepted (which normally NF_HOOK would automatically call okfn() on, but we want to be able to potentially pass the list to a listified version of okfn().) The netfilter hooks themselves are indirect calls that still happen per- packet (see nf_hook_entry_hookfn()), but again, changing that can be left for future work. There is potential for out-of-order receives if the netfilter hook ends up synchronously stealing packets, as they will be processed before any accepts earlier in the list. However, it was already possible for an asynchronous accept to cause out-of-order receives, so presumably this is considered OK. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:20 +09:00
Edward Cree	88eb1944e1	net: core: propagate SKB lists through packet_type lookup __netif_receive_skb_core() does a depressingly large amount of per-packet work that can't easily be listified, because the another_round looping makes it nontrivial to slice up into smaller functions. Fortunately, most of that work disappears in the fast path: * Hardware devices generally don't have an rx_handler * Unless you're tcpdumping or something, there is usually only one ptype * VLAN processing comes before the protocol ptype lookup, so doesn't force a pt_prev deliver so normally, __netif_receive_skb_core() will run straight through and pass back the one ptype found in ptype_base[hash of skb->protocol]. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:20 +09:00
Edward Cree	4ce0017a37	net: core: another layer of lists, around PF_MEMALLOC skb handling First example of a layer splitting the list (rather than merely taking individual packets off it). Involves new list.h function, list_cut_before(), like list_cut_position() but cuts on the other side of the given entry. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:19 +09:00
Edward Cree	7da517a3bc	net: core: Another step of skb receive list processing netif_receive_skb_list_internal() now processes a list and hands it on to the next function. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:19 +09:00
Edward Cree	920572b732	net: core: unwrap skb list receive slightly further Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:19 +09:00
Edward Cree	e090bfb9f1	sfc: batch up RX delivery Improves packet rate of 1-byte UDP receives by up to 10%. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:19 +09:00
Edward Cree	f6ad8c1bcd	net: core: trivial netif_receive_skb_list() entry point Just calls netif_receive_skb() in a loop. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 14:06:19 +09:00
David S. Miller	2bdea157b9	Merge branch 'sctp-fully-support-for-dscp-and-flowlabel-per-transport' Xin Long says: ==================== sctp: fully support for dscp and flowlabel per transport Now dscp and flowlabel are set from sock when sending the packets, but being multi-homing, sctp also supports for dscp and flowlabel per transport, which is described in section 8.1.12 in RFC6458. v1->v2: - define ip_queue_xmit as inline in net/ip.h, instead of exporting it in Patch 1/5 according to David's suggestion. - fix the param len check in sctp_s/getsockopt_peer_addr_params() in Patch 3/5 to guarantee that an old app built with old kernel headers could work on the newer kernel per Marcelo's point. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:36:55 +09:00
Xin Long	0999f021c9	sctp: check for ipv6_pinfo legal sndflow with flowlabel in sctp_v6_get_dst The transport with illegal flowlabel should not be allowed to send packets. Other transport protocols already denies this. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:36:54 +09:00
Xin Long	4be4139f7d	sctp: add support for setting flowlabel when adding a transport Struct sockaddr_in6 has the member sin6_flowinfo that includes the ipv6 flowlabel, it should also support for setting flowlabel when adding a transport whose ipaddr is from userspace. Note that addrinfo in sctp_sendmsg is using struct in6_addr for the secondary addrs, which doesn't contain sin6_flowinfo, and it needs to copy sin6_flowinfo from the primary addr. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:36:54 +09:00
Xin Long	0b0dce7a36	sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams spp_ipv6_flowlabel and spp_dscp are added in sctp_paddrparams in this patch so that users could set sctp_sock/asoc/transport dscp and flowlabel with spp_flags SPP_IPV6_FLOWLABEL or SPP_DSCP by SCTP_PEER_ADDR_PARAMS , as described section 8.1.12 in RFC6458. As said in last patch, it uses '\| 0x100000' or '\|0x1' to mark flowlabel or dscp is set, so that their values could be set to 0. Note that to guarantee that an old app built with old kernel headers could work on the newer kernel, the param's check in sctp_g/setsockopt_peer_addr_params() is also improved, which follows the way that sctp_g/setsockopt_delayed_ack() or some other sockopts' process that accept two types of params does. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:36:54 +09:00
Xin Long	8a9c58d28d	sctp: add support for dscp and flowlabel per transport Like some other per transport params, flowlabel and dscp are added in transport, asoc and sctp_sock. By default, transport sets its value from asoc's, and asoc does it from sctp_sock. flowlabel only works for ipv6 transport. Other than that they need to be passed down in sctp_xmit, flow4/6 also needs to set them before looking up route in get_dst. Note that it uses '& 0x100000' to check if flowlabel is set and '& 0x1' (tos 1st bit is unused) to check if dscp is set by users, so that they could be set to 0 by sockopt in next patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:36:54 +09:00
Xin Long	69b9e1e07d	ipv4: add __ip_queue_xmit() that supports tos param This patch introduces __ip_queue_xmit(), through which the callers can pass tos param into it without having to set inet->tos. For ipv6, ip6_xmit() already allows passing tclass parameter. It's needed when some transport protocol doesn't use inet->tos, like sctp's per transport dscp, which will be added in next patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:36:54 +09:00
Linus Walleij	05bd97fc55	net: dsa: Add Vitesse VSC73xx DSA router driver This adds a DSA driver for: Vitesse VSC7385 SparX-G5 5-port Integrated Gigabit Ethernet Switch Vitesse VSC7388 SparX-G8 8-port Integrated Gigabit Ethernet Switch Vitesse VSC7395 SparX-G5e 5+1-port Integrated Gigabit Ethernet Switch Vitesse VSC7398 SparX-G8e 8-port Integrated Gigabit Ethernet Switch These switches have a built-in 8051 CPU and can download and execute firmware in this CPU. They can also be configured to use an external CPU handling the switch in a memory-mapped manner by connecting to that external CPU's memory bus. This driver (currently) only takes control of the switch chip over SPI and configures it to route packages around when connected to a CPU port. The chip has embedded PHYs and VLAN support so we model it using DSA as a best fit so we can easily add VLAN support and maybe later also exploit the internal frame header to get more direct control over the switch. The four built-in GPIO lines are exposed using a standard GPIO chip. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:30:02 +09:00
Linus Walleij	975ae7c69d	net: phy: vitesse: Add support for VSC73xx The VSC7385, VSC7388, VSC7395 and VSC7398 are integrated switch/router chips for 5+1 or 8-port switches/routers. When managed directly by Linux using DSA we need to do a special set-up "dance" on the PHY. Unfortunately these sequences switches the PHY to undocumented pages named 2a30 and 52b6 and does undocumented things. It is described by these opaque sequences also in the reference manual. This is a best effort to integrate it anyways. Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:30:02 +09:00
Linus Walleij	1decd2ec22	net: dsa: Add DT bindings for Vitesse VSC73xx switches This adds the device tree bindings for the Vitesse VSC73xx switches. We also add the vendor name for Vitesse. Cc: devicetree@vger.kernel.org Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-04 11:30:01 +09:00

1 2 3 4 5 ...

767646 Commits All Branches Search

767646 Commits

All Branches