select/poll busy-poll support.
Split sysctl value into two separate ones, one for read and one for poll.
updated Documentation/sysctl/net.txt
Add a new poll flag POLL_LL. When this flag is set, sock_poll will call
sk_poll_ll if possible. sock_poll sets this flag in its return value
to indicate to select/poll when a socket that can busy poll is found.
When poll/select have nothing to report, call the low-level
sock_poll again until we are out of time or we find something.
Once the system call finds something, it stops setting POLL_LL, so it can
return the result to the user ASAP.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of skb_seq_read() are currently forced to call skb_abort_seq_read()
even when consuming all the data because the last call to skb_seq_read (the
one that returns 0 to indicate the end) fails to unmap the last fragment page.
With this patch callers will be allowed to traverse the SKB data by calling
skb_prepare_seq_read() once and repeatedly calling skb_seq_read() as originally
intended (and documented in the original commit 677e90eda), that is, only call
skb_abort_seq_read() if the sequential read is actually aborted.
Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netif_alloc_netdev_queues() uses kcalloc() to allocate memory
for the "struct netdev_queue *_tx" array.
For large number of tx queues, kcalloc() might fail, so this
patch does a fallback to vzalloc().
As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
thresh and interval are global resources,
only init net can change them.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Though we don't export the /proc/sys/net/ipv[4,6]/neigh/default/
directory to the un-init_net, but we can still use cmd such as
"ip ntable change name arp_cache locktime 129" to change the locktime
of default neigh_parms.
This patch disallows the un-init_net to find out the neigh_table.parms.
So the un-init_net will failed to influence the init_net.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
neigh_table.parms always exist and is initialized,kmemdup
can use it to create new neigh_parms, actually lookup_neigh_parms
here will return neigh_table.parms too.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/Kconfig
drivers/net/xen-netback/netback.c
net/batman-adv/bat_iv_ogm.c
net/wireless/nl80211.c
The ath9k Kconfig conflict was a change of a Kconfig option name right
next to the deletion of another option.
The xen-netback conflict was overlapping changes involving the
handling of the notify list in xen_netbk_rx_action().
Batman conflict resolution provided by Antonio Quartulli, basically
keep everything in both conflict hunks.
The nl80211 conflict is a little more involved. In 'net' we added a
dynamic memory allocation to nl80211_dump_wiphy() to fix a race that
Linus reported. Meanwhile in 'net-next' the handlers were converted
to use pre and post doit handlers which use a flag to determine
whether to hold the RTNL mutex around the operation.
However, the dump handlers to not use this logic. Instead they have
to explicitly do the locking. There were apparent bugs in the
conversion of nl80211_dump_wiphy() in that we were not dropping the
RTNL mutex in all the return paths, and it seems we very much should
be doing so. So I fixed that whilst handling the overlapping changes.
To simplify the initial returns, I take the RTNL mutex after we try
to allocate 'tb'.
Signed-off-by: David S. Miller <davem@davemloft.net>
As part of the push to add 802.1ad server provider tagging support to the
kernel the VLAN features flags were renamed. Unfortunately the kernel name
for the VLAN hardware acceleration features that the kernel shows user space
was included in the rename, which broke ethtool (txvlan and rxvlan options
do not work). This patch restores the original names, i.e. the original ABI.
If we wanted to make clear to users that we are refering to CTAGs we can
always change ethtool's short_name and long_name for these features (for
example something along the lines of txvlan -> txvlan-ctag, tx-vlan-offload ->
tx-vlan-ctag-offload).
Cc: Patrick McHardy <kaber@trash.net>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
adds a socket option for low latency polling.
This allows overriding the global sysctl value with a per-socket one.
Unexport sysctl_net_ll_poll since for now it's not needed in modules.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason for sysctl_net_ll_poll to be an unsigned long.
Change it into an unsigned int.
Fix the proc handler.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink directives and ndo entry to allow for controling
VF link, which can be in one of three states:
Auto - VF link state reflects the PF link state (default)
Up - VF link state is up, traffic from VF to VF works even if
the actual PF link is down
Down - VF link state is down, no traffic from/to this VF, can be of
use while configuring the VF
Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caught by sparse:
- __rcu: missing annotation to sd->flow_limit
- __user: direct access in cpumask_scnprintf
Also
- add endline character when printing bitmap if room in buffer
- avoid bucket overflow by reducing FLOW_LIMIT_HISTORY
The last item warrants some explanation. The hashtable buckets are
subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
to bucket size, since all packets may end up in a single bucket. The
current (rather arbitrary) history value of 256 happens to match the
buffer size (u8).
As a result, with a single flow, the first 128 packets are accepted
(correct), the second 128 packets dropped (correct) and then the
history[] array has filled, so that each subsequent new packet
causes an increment in the bucket for new_flow plus a decrement
for old_flow: a steady state.
This is fine if packets are dropped, as the steady state goes away
as soon as a mix of traffic reappears. But, because the 256th packet
overflowed the bucket to 0: no packets are dropped.
Instead of explicitly adding an overflow check, this patch changes
FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
(first item)
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce the uses of this unnecessary typedef.
Done via perl script:
$ git grep --name-only -w ctl_table net | \
xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'
Reflow the modified lines that now exceed 80 columns.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since team functionality relies heavily on userspace daemon, we need to
deliver event to userspace via Netlink as quick as possible. So make all
team port device link events urgent.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
We currently allow for numa-node aware skb allocation only within the
fill_packet_ipv4() path, but not in fill_packet_ipv6(). Consolidate that
code to a common allocation helper to enable numa-node aware skb
allocation for ipv6, and use it in both paths. This also makes both
functions a bit more readable.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct gnet_stats_rate_est contains u32 fields, so the bytes per second
field can wrap at 34360Mbit.
Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
and switch the kernel to use this structure natively.
This structure is dumped to user space as a new attribute :
TCA_STATS_RATE_EST64
Old tc command will now display the capped bps (to 34360Mbit), instead
of wrapped values, and updated tc command will display correct
information.
Old tc command output, after patch :
eric:~# tc -s -d qd sh dev lo
qdisc pfifo 8001: root refcnt 2 limit 1000p
Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
rate 34360Mbit 189696pps backlog 0b 0p requeues 0
This patch carefully reorganizes "struct Qdisc" layout to get optimal
performance on SMP.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 1a37e412a022(net: Use 16bits for *_headers fields of struct
skbuff), skb->*_header are relative to skb->head,
so copy_skb_header() should not call skb_headers_offset_update() now,
and we should pass correct parameter to skb_headers_offset_update() in
pskb_expand_head() and skb_copy_expand().
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Filters need to be translated to real BPF code for userland, like SO_GETFILTER.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add upport for busy-polling on UDP sockets.
In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
from the skb into the sk.
This is done at the earliest possible moment, right after we identify
which socket this skb is for.
In __skb_recv_datagram When there is no data and the user
tries to read we busy poll.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge 'net' bug fixes into 'net-next' as we have patches
that will build on top of them.
This merge commit includes a change from Emil Goode
(emilgoode@gmail.com) that fixes a warning that would
have been introduced by this merge. Specifically it
fixes the pingv6_ops method ipv6_chk_addr() to add a
"const" to the "struct net_device *dev" argument and
likewise update the dummy_ipv6_chk_addr() declaration.
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to the problem in pktgen, netpoll uses skb_tail_offset()
too, as the code is copied from pktgen.
Also use return values of skb_put() directly, this will simiplify
the code.
Reported-by: Thomas Graf <tgraf@suug.ch>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_set_network_header() expects an offset based on the data pointer
whereas skb_tail_offset() also includes the headroom. This resulted
in the ip header being written in a wrong location.
Use return values of skb_put() directly and rely on skb->len to
set mac, network, and transport header.
Cc: Simon Horman <horms@verge.net.au>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Assisted-by: Daniel Borkmann <dborkmann@redhat.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <dborkmann@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet spotted that we have to check skb->head instead
of skb->data as skb->head points to the beginning of the
data area of the skbuff. Similarly, we have to initialize the
skb->head pointer, not skb->data in __alloc_skb_head.
After this fix, netlink crashes in the release path of the
sk_buff, so let's fix that as well.
This bug was introduced in (0ebd0ac net: add function to
allocate sk_buff head without data area).
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_create_hash() is only called from netdev_init() which is marked
__net_init.
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 1a37e412a0 (net: Use 16bits for *_headers
fields of struct skbuff) converts skb->*_header to u16,
some #if NET_SKBUFF_DATA_USES_OFFSET are now useless,
and to be safe, we could just use "X = (typeof(X)) ~0U;"
as suggested by David.
Cc: David S. Miller <davem@davemloft.net>
Cc: Simon Horman <horms@verge.net.au>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dev_mc_sync_multiple function is currently calling
__hw_addr_sync, and not __hw_addr_sync_multiple. This will result in
addresses only being synced to the first device from the set.
Corrected by calling the _multiple variant.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, __hw_addr_sync_one is called in a loop by
__hw_addr_sync_multiple to sync each of a "from" device's hw addresses
to a "to" device. __hw_addr_sync_one calls __hw_addr_add_ex to attempt
to add each address. __hw_addr_add_ex is called with global=false, and
sync=true.
__hw_addr_add_ex checks to see if the new address matches an
address already on the list. If so, it tests global and sync. In this
case, sync=true, and it then checks if the address is already synced,
and if so, returns 0.
This 0 return causes __hw_addr_sync_one to increment the sync_cnt
and refcount for the "from" list's address entry, even though the address
is already synced and has a reference and sync_cnt. This will cause
the sync_cnt and refcount to increment without bound every time an
addresses is added to the "from" device and synced to the "to" device.
The fix here has two parts:
First, when __hw_addr_add_ex finds the address already exists
and is synced, return -EEXIST instead of 0.
Second, __hw_addr_sync_one checks the error return for -EEXIST,
and if so, it (a) does not add a refcount/sync_cnt, and (b) returns 0
itself so that __hw_addr_sync_multiple will not return an error.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an address is added to a subordinate interface (the "to"
list), the address entry in the "from" list is not marked "synced" as
the entry added to the "to" list is.
When performing the unsync operation (e.g., dev_mc_unsync),
__hw_addr_unsync_one calls __hw_addr_del_entry with the "synced"
parameter set to true for the case when the address reference is being
released from the "from" list. This causes a test inside to fail,
with the result being that the reference count on the "from" address
is not properly decremeted and the address on the "from" list will
never be freed.
Correct this by having __hw_addr_unsync_one call the
__hw_addr_del_entry function with the "sync" flag set to false for the
"remove from the from list" case.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sync_cnt field is not being initialized, which can result
in arbitrary values in the field. Fixed by initializing it to zero.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Reviewed-by: Vlad Yasevich <vyasevic@redhat.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The three arrays of strings: af_family_key_strings,
af_family_slock_key_strings and af_family_clock_key_strings have not
VSOCK's string
Signed-off-by: Federico Vaga <federico.vaga@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer however skb->network_header is now
an offset.
This patch corrects the problem by adding a wrapper to return skb tail as
an offset regardless of the value of NET_SKBUFF_DATA_USES_OFFSET. It seems
that skb->tail that this offset may be more than 64k and some care has been
taken to treat such cases as an error.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This corrects an regression introduced by "net: Use 16bits for *_headers
fields of struct skbuff" when NET_SKBUFF_DATA_USES_OFFSET is not set. In
that case skb->tail will be a pointer whereas skb->transport_header
will be an offset from head. This is corrected by using wrappers that
ensure that comparisons and calculations are always made using pointers.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 351638e7de (net: pass info struct via netdevice notifier)
breaks booting of my KVM guest, this is due to we still forget to pass
struct netdev_notifier_info in several places. This patch completes it.
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/skbuff.c: In function ‘__alloc_skb_head’:
net/core/skbuff.c:203:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
net/core/skbuff.c: In function ‘__alloc_skb’:
net/core/skbuff.c:279:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
net/core/skbuff.c:280:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
net/core/skbuff.c: In function ‘build_skb’:
net/core/skbuff.c:348:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
net/core/skbuff.c:349:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
Signed-off-by: David S. Miller <davem@davemloft.net>
Use new netdevice notifier infrastructure to pass along changed flags.
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: shortened notifier_info struct name
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
The netpoll_rx_disable() will always return 0, it is no use and looks wordy,
so remove the unnecessary code and get rid of it in _dev_open and _dev_close.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.
The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.
SKB Usage:
When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:
* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
skb->inner_protocol is added by this patch.
* Set skb->protocol to the new MPLS ethertype of the packet.
* Set skb->network_header to correspond to the
end of the L3 header, including the MPLS label stack.
I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code. The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.
Features:
I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols. Jesse, I understand that
you have some ideas here. I am more than happy to change my implementation.
This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.
A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload. Currently no devices support this
and MPLS GSO always falls back to software.
Alternate Implementation:
One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().
I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.
MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.
Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when upper device is changed, event is not propagated via RT Netlink
to userspace. Userspace might never now about the change. Fix this by
adding upper-device-change notifier event.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net into net-next because some upcoming net-next changes
build on top of bug fixes that went into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a generic solution to resolve a specific problem that I have observed.
If the encapsulation of an skb changes then ability to offload checksums
may also change. In particular it may be necessary to perform checksumming
in software.
An example of such a case is where a non-GRE packet is received but
is to be encapsulated and transmitted as GRE.
Another example relates to my proposed support for for packets
that are non-MPLS when received but MPLS when transmitted.
The cost of this change is that the value of the csum variable may be
checked when it previously was not. In the case where the csum variable is
true this is pure overhead. In the case where the csum variable is false it
leads to software checksumming, which I believe also leads to correct
checksums in transmitted packets for the cases described above.
Further analysis:
This patch relies on the return value of can_checksum_protocol()
being correct and in turn the return value of skb_network_protocol(),
used to provide the protocol parameter of can_checksum_protocol(),
being correct. It also relies on the features passed to skb_segment()
and in turn to can_checksum_protocol() being correct.
I believe that this problem has not been observed for VLANs because it
appears that almost all drivers, the exception being xgbe, set
vlan_features such that that the checksum offload support for VLAN packets
is greater than or equal to that of non-VLAN packets.
I wonder if the code in xgbe may be an oversight and the hardware does
support checksumming of VLAN packets. If so it may be worth updating the
vlan_features of the driver as this patch will force such checksums to be
performed in software rather than hardware.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
A cpu executing the network receive path sheds packets when its input
queue grows to netdev_max_backlog. A single high rate flow (such as a
spoofed source DoS) can exceed a single cpu processing rate and will
degrade throughput of other flows hashed onto the same cpu.
This patch adds a more fine grained hashtable. If the netdev backlog
is above a threshold, IRQ cpus track the ratio of total traffic of
each flow (using 4096 buckets, configurable). The ratio is measured
by counting the number of packets per flow over the last 256 packets
from the source cpu. Any flow that occupies a large fraction of this
(set at 50%) will see packet drop while above the threshold.
Tested:
Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
kernel receive (RPS) on cpu0 and application threads on cpus 2--7
each handling 20k req/s. Throughput halves when hit with a 400 kpps
antagonist storm. With this patch applied, antagonist overload is
dropped and the server processes its complete load.
The patch is effective when kernel receive processing is the
bottleneck. The above RPS scenario is a extreme, but the same is
reached with RFS and sufficient kernel processing (iptables, packet
socket tap, ..).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ERROR: "memcpy_fromiovec" [drivers/vhost/vhost_scsi.ko] undefined!
That function is only present with CONFIG_NET. Turns out that
crypto/algif_skcipher.c also uses that outside net, but it actually
needs sockets anyway.
In addition, commit 6d4f0139d6 added
CONFIG_NET dependency to CONFIG_VMCI for memcpy_toiovec, so hoist
that function and revert that commit too.
socket.h already includes uio.h, so no callers need updating; trying
only broke things fo x86_64 randconfig (thanks Fengguang!).
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This was added by commit 59b9997bab (Revert "net: maintain namespace
isolation between vlan and real device").
In fact, before the initial commit - the one that is reverted -, this
statement was not present.
'skb->dev = dev' is already done in eth_type_trans(), which is call just
after.
Spotted-by: Alain Ritoux <alain.ritoux@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have seen multiple NULL dereferences in __inet6_lookup_established()
After analysis, I found that inet6_sk() could be NULL while the
check for sk_family == AF_INET6 was true.
Bug was added in linux-2.6.29 when RCU lookups were introduced in UDP
and TCP stacks.
Once an IPv6 socket, using SLAB_DESTROY_BY_RCU is inserted in a hash
table, we no longer can clear pinet6 field.
This patch extends logic used in commit fcbdf09d96
("net: fix nulls list corruptions in sk_prot_alloc")
TCP/UDP/UDPLite IPv6 protocols provide their own .clear_sk() method
to make sure we do not clear pinet6 field.
At socket clone phase, we do not really care, as cloning the parent (non
NULL) pinet6 is not adding a fatal race.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rather than having logic to calculate inner protocol in every
tunnel gso handler move it to gso code. This simplifies code.
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cong Wang <amwang@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The return value is reversed from mutex_trylock().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>