Some network drivers use dev_kfree_skb_any() and dev_kfree_skb_irq()
helpers to free skbs, both for dropped packets and TX completed ones.
We need to separate the two causes to get better diagnostics
given by dropwatch or "perf record -e skb:kfree_skb"
This patch provides two new helpers, dev_consume_skb_any() and
dev_consume_skb_irq() to be used for consumed skbs.
__dev_kfree_skb_irq() is slightly optimized to remove one
atomic_dec_and_test() in fast path, and use this_cpu_{r|w} accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull trivial tree updates from Jiri Kosina:
"Usual earth-shaking, news-breaking, rocket science pile from
trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
doc: usb: Fix typo in Documentation/usb/gadget_configs.txt
doc: add missing files to timers/00-INDEX
timekeeping: Fix some trivial typos in comments
mm: Fix some trivial typos in comments
irq: Fix some trivial typos in comments
NUMA: fix typos in Kconfig help text
mm: update 00-INDEX
doc: Documentation/DMA-attributes.txt fix typo
DRM: comment: `halve' -> `half'
Docs: Kconfig: `devlopers' -> `developers'
doc: typo on word accounting in kprobes.c in mutliple architectures
treewide: fix "usefull" typo
treewide: fix "distingush" typo
mm/Kconfig: Grammar s/an/a/
kexec: Typo s/the/then/
Documentation/kvm: Update cpuid documentation for steal time and pv eoi
treewide: Fix common typo in "identify"
__page_to_pfn: Fix typo in comment
Correct some typos for word frequency
clk: fixed-factor: Fix a trivial typo
...
Pull networking updates from David Miller:
1) The addition of nftables. No longer will we need protocol aware
firewall filtering modules, it can all live in userspace.
At the core of nftables is a, for lack of a better term, virtual
machine that executes byte codes to inspect packet or metadata
(arriving interface index, etc.) and make verdict decisions.
Besides support for loading packet contents and comparing them, the
interpreter supports lookups in various datastructures as
fundamental operations. For example sets are supports, and
therefore one could create a set of whitelist IP address entries
which have ACCEPT verdicts attached to them, and use the appropriate
byte codes to do such lookups.
Since the interpreted code is composed in userspace, userspace can
do things like optimize things before giving it to the kernel.
Another major improvement is the capability of atomically updating
portions of the ruleset. In the existing netfilter implementation,
one has to update the entire rule set in order to make a change and
this is very expensive.
Userspace tools exist to create nftables rules using existing
netfilter rule sets, but both kernel implementations will need to
co-exist for quite some time as we transition from the old to the
new stuff.
Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have
worked so hard on this.
2) Daniel Borkmann and Hannes Frederic Sowa made several improvements
to our pseudo-random number generator, mostly used for things like
UDP port randomization and netfitler, amongst other things.
In particular the taus88 generater is updated to taus113, and test
cases are added.
3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet
and Yang Yingliang.
4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin
Sujir.
5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet,
Neal Cardwell, and Yuchung Cheng.
6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary
control message data, much like other socket option attributes.
From Francesco Fusco.
7) Allow applications to specify a cap on the rate computed
automatically by the kernel for pacing flows, via a new
SO_MAX_PACING_RATE socket option. From Eric Dumazet.
8) Make the initial autotuned send buffer sizing in TCP more closely
reflect actual needs, from Eric Dumazet.
9) Currently early socket demux only happens for TCP sockets, but we
can do it for connected UDP sockets too. Implementation from Shawn
Bohrer.
10) Refactor inet socket demux with the goal of improving hash demux
performance for listening sockets. With the main goals being able
to use RCU lookups on even request sockets, and eliminating the
listening lock contention. From Eric Dumazet.
11) The bonding layer has many demuxes in it's fast path, and an RCU
conversion was started back in 3.11, several changes here extend the
RCU usage to even more locations. From Ding Tianhong and Wang
Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav
Falico.
12) Allow stackability of segmentation offloads to, in particular, allow
segmentation offloading over tunnels. From Eric Dumazet.
13) Significantly improve the handling of secret keys we input into the
various hash functions in the inet hashtables, TCP fast open, as
well as syncookies. From Hannes Frederic Sowa. The key fundamental
operation is "net_get_random_once()" which uses static keys.
Hannes even extended this to ipv4/ipv6 fragmentation handling and
our generic flow dissector.
14) The generic driver layer takes care now to set the driver data to
NULL on device removal, so it's no longer necessary for drivers to
explicitly set it to NULL any more. Many drivers have been cleaned
up in this way, from Jingoo Han.
15) Add a BPF based packet scheduler classifier, from Daniel Borkmann.
16) Improve CRC32 interfaces and generic SKB checksum iterators so that
SCTP's checksumming can more cleanly be handled. Also from Daniel
Borkmann.
17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces
using the interface MTU value. This helps avoid PMTU attacks,
particularly on DNS servers. From Hannes Frederic Sowa.
18) Use generic XPS for transmit queue steering rather than internal
(re-)implementation in virtio-net. From Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
random32: add test cases for taus113 implementation
random32: upgrade taus88 generator to taus113 from errata paper
random32: move rnd_state to linux/random.h
random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized
random32: add periodic reseeding
random32: fix off-by-one in seeding requirement
PHY: Add RTL8201CP phy_driver to realtek
xtsonic: add missing platform_set_drvdata() in xtsonic_probe()
macmace: add missing platform_set_drvdata() in mace_probe()
ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe()
ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh
vlan: Implement vlan_dev_get_egress_qos_mask as an inline.
ixgbe: add warning when max_vfs is out of range.
igb: Update link modes display in ethtool
netfilter: push reasm skb through instead of original frag skbs
ip6_output: fragment outgoing reassembled skb properly
MAINTAINERS: mv643xx_eth: take over maintainership from Lennart
net_sched: tbf: support of 64bit rates
ixgbe: deleting dfwd stations out of order can cause null ptr deref
ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS
...
Add a operations structure that allows a network interface to export
the fact that it supports package forwarding in hardware between
physical interfaces and other mac layer devices assigned to it (such
as macvlans). This operaions structure can be used by virtual mac
devices to bypass software switching so that forwarding can be done
in hardware more efficiently.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a bug in cpsw_probe() where we do:
ndev->irq = platform_get_irq(pdev, 0);
if (ndev->irq < 0) {
The problem is that "ndev->irq" is unsigned so the error handling
doesn't work. I have changed it to a regular int.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's the big driver core / sysfs update for 3.13-rc1.
There's lots of dev_groups updates for different subsystems, as they all
get slowly migrated over to the safe versions of the attribute groups
(removing userspace races with the creation of the sysfs files.) Also
in here are some kobject updates, devres expansions, and the first round
of Tejun's sysfs reworking to enable it to be used by other subsystems
as a backend for an in-kernel filesystem.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iEYEABECAAYFAlJ6xAMACgkQMUfUDdst+yk1kQCfcHXhfnrvFZ5J/mDP509IzhNS
ddEAoLEWoivtBppNsgrWqXpD1vi4UMsE
=JmVW
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core / sysfs patches from Greg KH:
"Here's the big driver core / sysfs update for 3.13-rc1.
There's lots of dev_groups updates for different subsystems, as they
all get slowly migrated over to the safe versions of the attribute
groups (removing userspace races with the creation of the sysfs
files.) Also in here are some kobject updates, devres expansions, and
the first round of Tejun's sysfs reworking to enable it to be used by
other subsystems as a backend for an in-kernel filesystem.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits)
sysfs: rename sysfs_assoc_lock and explain what it's about
sysfs: use generic_file_llseek() for sysfs_file_operations
sysfs: return correct error code on unimplemented mmap()
mdio_bus: convert bus code to use dev_groups
device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name
sysfs: separate out dup filename warning into a separate function
sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c
sysfs: remove unused sysfs_get_dentry() prototype
sysfs: honor bin_attr.attr.ignore_lockdep
sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr
devres: restore zeroing behavior of devres_alloc()
sysfs: fix sysfs_write_file for bin file
input: gameport: convert bus code to use dev_groups
input: serio: remove bus usage of dev_attrs
input: serio: use DEVICE_ATTR_RO()
i2o: convert bus code to use dev_groups
memstick: convert bus code to use dev_groups
tifm: convert bus code to use dev_groups
virtio: convert bus code to use dev_groups
ipack: convert bus code to use dev_groups
...
Joby Poriyath provided a xen-netback patch to reduce the size of
xenvif structure as some netdev allocation could fail under
memory pressure/fragmentation.
This patch is handling the problem at the core level, allowing
any netdev structures to use vmalloc() if kmalloc() failed.
As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Joby Poriyath <joby.poriyath@citrix.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
napi_disable uses an msleep() call to wait for outstanding napi work to be
finished after setting the disable bit. It does not always sleep incase there
was no outstanding work. This resulted in a rare bug in ixgbe_down operation
where a napi_disable call took place inside of a local_bh_disable()d context.
In order to enable easier detection of future sleep while atomic BUGs, this
patch adds a might_sleep() call, so that every use of napi_disable during
atomic context will be visible.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: Alexander Duyck <alexander.duyck@intel.com>
Cc: Hyong-Youb Kim <hykim@myri.com>
Cc: Amir Vadai <amirv@mellanox.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Macro definitions should not normally end with a semi-colon, as this
makes it dangerous to use them an if...else statement. Happily this
has not happened yet.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Correct common misspelling of "identify" as "indentify" throughout
the kernel
Signed-off-by: Maxime Jayat <maxime@artisandeveloppeur.fr>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Conflicts:
include/linux/netdevice.h
net/core/sock.c
Trivial merge issues.
Removal of "extern" for functions declaration in netdevice.h
at the same time "const" was added to an argument.
Two parallel line additions in net/core/sock.c
Signed-off-by: David S. Miller <davem@davemloft.net>
Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list. The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown. Resulting in subtle memory
corruption.
This is the second bug from sharing the storage between the close_list
and the unreg_list. The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.
v2: Make all callers pass in a close_list to dev_close_many
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio wants to pass in cpumask_of(cpu), make parameter
const to avoid build warnings.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/linux/netdevice.h
More extern removals from Joe Perches.
Minor conflict with the dev_notify_flags changes which added a new
argument to __dev_notify_flags().
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch only prepares the next one, there is no functional change.
Now, __dev_notify_flags() can also be used to notify flags changes via
rtnetlink.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a mix of function prototypes with and without extern
in the kernel sources. Standardize on not using extern for
function prototypes.
Function prototypes don't need to be written with extern.
extern is assumed by the compiler. Its use is as unnecessary as
using auto to declare automatic/local variables in a block.
Signed-off-by: Joe Perches <joe@perches.com>
sysfs ns (namespace) implementation became more convoluted than
necessary while trying to hide ns information from visible interface.
The relatively recent attr ns support is a good example.
* attr ns tag is determined by sysfs_ops->namespace() callback while
dir tag is determined by kobj_type->namespace(). The placement is
arbitrary.
* Instead of performing operations with explicit ns tag, the namespace
callback is routed through sysfs_attr_ns(), sysfs_ops->namespace(),
class_attr_namespace(), class_attr->namespace(). It's not simpler
in any sense. The only thing this convolution does is traversing
the whole stack backwards.
The namespace callbacks are unncessary because the operations involved
are inherently synchronous. The information can be provided in in
straight-forward top-down direction and reversing that direction is
unnecessary and against basic design principles.
This backward interface is unnecessarily convoluted and hinders
properly separating out sysfs from driver model / kobject for proper
layering. This patch updates attr ns support such that
* sysfs_ops->namespace() and class_attr->namespace() are dropped.
* sysfs_{create|remove}_file_ns(), which take explicit @ns param, are
added and sysfs_{create|remove}_file() are now simple wrappers
around the ns aware functions.
* ns handling is dropped from sysfs_chmod_file(). Nobody uses it at
this point. sysfs_chmod_file_ns() can be added later if necessary.
* Explicit @ns is propagated through class_{create|remove}_file_ns()
and netdev_class_{create|remove}_file_ns().
* driver/net/bonding which is currently the only user of attr
namespace is updated to use netdev_class_{create|remove}_file_ns()
with @bh->net as the ns tag instead of using the namespace callback.
This patch should be an equivalent conversion without any functional
difference. It makes the code easier to follow, reduces lines of code
a bit and helps proper separation and layering.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It will be useful to get first/last element.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a possibility to iterate through netdev_adjacent's private, currently
only for lower neighbours.
Add both RCU and RTNL/other locking variants of iterators, and make the
non-rcu variant to be safe from removal.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes sparse warnings when incorrectly handling the port number
and using int instead of unsigned int iterating through &vn->sock_list[].
Keeping the port as __be16 also makes things clearer wrt endianess.
Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks
which got removed.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide a kernel-doc comment documentation for the BQL helpers:
- netdev_sent_queue
- netdev_completed_queue
- netdev_reset_queue
Similarly to how it is done for the other functions, the documentation
only covers the function operating on struct net_device and not struct
netdev_queue.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and
ndo_del_rx_vxlan_port().
Drivers can get notifications through the above functions about changes
of the UDP listening port of VXLAN. Also, when physical ports come up,
now they can call vxlan_get_rx_port() in order to obtain the port number(s)
of the existing VXLAN interface in case they already up before them.
This information about the listening UDP port would be used for VXLAN
related offloads.
A big thank you to John Fastabend (john.r.fastabend@intel.com) for his
input and his suggestions on this patch set.
CC: John Fastabend <john.r.fastabend@intel.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new macro netdev_for_each_upper_dev_rcu(dev, upper, iter) iterates
through the dev->upper_dev_list starting from the first element, using
the netdev_upper_get_next_dev_rcu(dev, &iter).
Must be called under RCU read lock.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eliezer renames several *ll_poll to *busy_poll, but forgets
CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.
Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a ndo for getting physical port of the device. Driver
which is aware of being virtual function of some physical port should
implement this ndo. This is applicable not only for IOV, but for other
solutions (NPAR, multichannel) as well. Basically if there is possible
to have multiple netdevs on the single hw port.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No users outside net/core/dev.c.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, bond_resend_igmp_join_requests() looks for vlans attached to
bonding device, bridge where bonding act as port manually. It does not
care of other scenarios, like stacked bonds or team device above. Make
this more generic and use netdev notifier to propagate the event to
upper devices and to actually call ip_mc_rejoin_groups().
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename ndo_ll_poll to ndo_busy_poll.
Rename sk_mark_ll to sk_mark_napi_id.
Rename skb_mark_ll to skb_mark_napi_id.
Correct all useres of these functions.
Update comments and defines in include/net/busy_poll.h
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/freescale/fec_main.c
drivers/net/ethernet/renesas/sh_eth.c
net/ipv4/gre.c
The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.
The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.
Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.
Signed-off-by: David S. Miller <davem@davemloft.net>
When the kernel (compiled with CONFIG_PREEMPT=n) is performing the
rename of a network interface, it can end up waiting for a workqueue
to complete. If userland is able to invoke a SIOCGIFNAME ioctl or a
SO_BINDTODEVICE getsockopt in between, the kernel will deadlock due to
the fact that read_secklock_begin() will spin forever waiting for the
writer process (the one doing the interface rename) to update the
devnet_rename_seq sequence.
This patch fixes the problem by adding a helper (netdev_get_name())
and using it in the code handling the SIOCGIFNAME ioctl and
SO_BINDTODEVICE setsockopt.
The netdev_get_name() helper uses raw_seqcount_begin() to avoid
spinning forever, waiting for devnet_rename_seq->sequence to become
even. cond_resched() is used in the contended case, before retrying
the access to give the writer process a chance to finish.
The use of raw_seqcount_begin() will incur some unneeded work in the
reader process in the contended case, but this is better than
deadlocking the system.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink directives and ndo entry to allow for controling
VF link, which can be in one of three states:
Auto - VF link state reflects the PF link state (default)
Up - VF link state is up, traffic from VF to VF works even if
the actual PF link is down
Down - VF link state is down, no traffic from/to this VF, can be of
use while configuring the VF
Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caught by sparse:
- __rcu: missing annotation to sd->flow_limit
- __user: direct access in cpumask_scnprintf
Also
- add endline character when printing bitmap if room in buffer
- avoid bucket overflow by reducing FLOW_LIMIT_HISTORY
The last item warrants some explanation. The hashtable buckets are
subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
to bucket size, since all packets may end up in a single bucket. The
current (rather arbitrary) history value of 256 happens to match the
buffer size (u8).
As a result, with a single flow, the first 128 packets are accepted
(correct), the second 128 packets dropped (correct) and then the
history[] array has filled, so that each subsequent new packet
causes an increment in the bucket for new_flow plus a decrement
for old_flow: a steady state.
This is fine if packets are dropped, as the steady state goes away
as soon as a mix of traffic reappears. But, because the 256th packet
overflowed the bucket to 0: no packets are dropped.
Instead of explicitly adding an overflow check, this patch changes
FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
(first item)
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds an ndo_ll_poll method and the code that supports it.
This method can be used by low latency applications to busy-poll
Ethernet device queues directly from the socket code.
sysctl_net_ll_poll controls how many microseconds to poll.
Default is zero (disabled).
Individual protocol support will be added by subsequent patches.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds a napi_id and a hashing mechanism to lookup a napi by id.
This will be used by subsequent patches to implement low latency
Ethernet device polling.
Based on a code sample by Eric Dumazet.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 351638e7de (net: pass info struct via netdevice notifier)
breaks booting of my KVM guest, this is due to we still forget to pass
struct netdev_notifier_info in several places. This patch completes it.
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use new netdevice notifier infrastructure to pass along changed flags.
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: shortened notifier_info struct name
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, only net_device * could be passed along with netdevice notifier
event. This patch provides a possibility to pass custom structure
able to provide info that event listener needs to know.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
v2->v3: fix typo on simeth
shortened dev_getter
shortened notifier_info struct name
v1->v2: fix notifier_call parameter in call_netdevice_notifier()
Signed-off-by: David S. Miller <davem@davemloft.net>
In the case where a non-MPLS packet is received and an MPLS stack is
added it may well be the case that the original skb is GSO but the
NIC used for transmit does not support GSO of MPLS packets.
The aim of this code is to provide GSO in software for MPLS packets
whose skbs are GSO.
SKB Usage:
When an implementation adds an MPLS stack to a non-MPLS packet it should do
the following to skb metadata:
* Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
skb->inner_protocol is added by this patch.
* Set skb->protocol to the new MPLS ethertype of the packet.
* Set skb->network_header to correspond to the
end of the L3 header, including the MPLS label stack.
I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
That patch sets the above requirements in datapath/actions.c:push_mpls()
and was used to exercise this code. The datapath patch is against the Open
vSwtich tree but it is intended that it be added to the Open vSwtich code
present in the mainline Linux kernel at some point.
Features:
I believe that the approach that I have taken is at least partially
consistent with the handling of other protocols. Jesse, I understand that
you have some ideas here. I am more than happy to change my implementation.
This patch adds dev->mpls_features which may be used by devices
to advertise features supported for MPLS packets.
A new NETIF_F_MPLS_GSO feature is added for devices which support
hardware MPLS GSO offload. Currently no devices support this
and MPLS GSO always falls back to software.
Alternate Implementation:
One possible alternate implementation is to teach netif_skb_features()
and skb_network_protocol() about MPLS, in a similar way to their
understanding of VLANs. I believe this would avoid the need
for net/mpls/mpls_gso.c and in particular the calls to
__skb_push() and __skb_push() in mpls_gso_segment().
I have decided on the implementation in this patch as it should
not introduce any overhead in the case where mpls_gso is not compiled
into the kernel or inserted as a module.
MPLS GSO suggested by Jesse Gross.
Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
by Pravin B Shelar.
Cc: Jesse Gross <jesse@nicira.com>
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when upper device is changed, event is not propagated via RT Netlink
to userspace. Userspace might never now about the change. Fix this by
adding upper-device-change notifier event.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge net into net-next because some upcoming net-next changes
build on top of bug fixes that went into net.
Signed-off-by: David S. Miller <davem@davemloft.net>
A cpu executing the network receive path sheds packets when its input
queue grows to netdev_max_backlog. A single high rate flow (such as a
spoofed source DoS) can exceed a single cpu processing rate and will
degrade throughput of other flows hashed onto the same cpu.
This patch adds a more fine grained hashtable. If the netdev backlog
is above a threshold, IRQ cpus track the ratio of total traffic of
each flow (using 4096 buckets, configurable). The ratio is measured
by counting the number of packets per flow over the last 256 packets
from the source cpu. Any flow that occupies a large fraction of this
(set at 50%) will see packet drop while above the threshold.
Tested:
Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0,
kernel receive (RPS) on cpu0 and application threads on cpus 2--7
each handling 20k req/s. Throughput halves when hit with a 400 kpps
antagonist storm. With this patch applied, antagonist overload is
dropped and the server processes its complete load.
The patch is effective when kernel receive processing is the
bottleneck. The above RPS scenario is a extreme, but the same is
reached with RFS and sufficient kernel processing (iptables, packet
socket tap, ..).
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some situations, we need to disable TSO on bonding slaves.
bonding device automatically unset TSO in bond_fix_features(), and
performance is not good because :
1) We consume more cpu cycles.
2) GSO segmentation has some bugs leading to out of order TCP packets
if this segmentation is done before virtual device. This particular
problem will be addressed in a separate patch.
This patch allows TSO being set/unset on the bonding master,
so that GSO segmentation is done after bonding layer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michał Mirosław <mirqus@gmail.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same story as with fib_trie patch - vfree() from RCU callbacks
is legitimate now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the rx_{add,kill}_vid callbacks to take a protocol argument in
preparation of 802.1ad support. The protocol argument used so far is
always htons(ETH_P_8021Q).
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the hardware VLAN acceleration features to include "CTAG" to indicate
that they only support CTAGs. Follow up patches will introduce 802.1ad
server provider tagging (STAGs) and require the distinction for hardware not
supporting acclerating both.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>