Commit Graph

377381 Commits

Author SHA1 Message Date
Yuchung Cheng 85f16525a2 tcp: properly send new data in fast recovery in first RTT
Linux sends new unset data during disorder and recovery state if all
(suspected) lost packets have been retransmitted ( RFC5681, section
3.2 step 1 & 2, RFC3517 section 4, NexSeg() Rule 2).  One requirement
is to keep the receive window about twice the estimated sender's
congestion window (tcp_rcv_space_adjust()), assuming the fast
retransmits repair the losses in the next round trip.

But currently it's not the case on the first round trip in either
normal or Fast Open connection, beucase the initial receive window
is identical to (expected) sender's initial congestion window. The
fix is to double it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:46:29 -07:00
Sergei Shtylyov 8e994402ad sh_eth: remove '__maybe_unused' annotations
Now that  the SoC specific support is no longer done with help of #ifdef'fery,
we  no longer need '__maybe_unused' annotations to sh_eth_select_mii() and
sh_eth_set_duplex()...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:45:19 -07:00
Joe Perches fe2c6338fd net: Convert uses of typedef ctl_table to struct ctl_table
Reduce the uses of this unnecessary typedef.

Done via perl script:

$ git grep --name-only -w ctl_table net | \
  xargs perl -p -i -e '\
	sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
        s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'

Reflow the modified lines that now exceed 80 columns.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:36:09 -07:00
Flavio Leitner 194f4a6df2 net: make all team port device link events urgent
Since team functionality relies heavily on userspace daemon, we need to
deliver event to userspace via Netlink as quick as possible. So make all
team port device link events urgent.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 02:31:41 -07:00
Wu Fengguang a06a2d378d net: ping_check_bind_addr() etc. can be static
net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?

net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:36:41 -07:00
Ben Hutchings 8c367fcbe6 cxgb4: Do not set net_device::dev_id to VI index
net_device::dev_id should not be used merely to indicate a VI index,
as it affects the way the local part of IPv6 addresses is normally
generated.

This field was intended for use where multiple devices may share a
single assigned MAC address and need to have different IPv6 addresses.
T4 VIs each have their own MAC address.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:24:17 -07:00
Jason Wang f57855a54f macvtap: fix uninitialized return value macvtap_ioctl_set_queue()
Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:23:11 -07:00
Jason Wang d9a90a3105 macvtap: slient sparse warnings
This patch silents the following sparse warnings:

drivers/net/macvtap.c:98:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:98:9:    expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:98:9:    got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:120:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:120:9:    expected struct macvtap_queue *<noident>
drivers/net/macvtap.c:120:9:    got struct macvtap_queue [noderef]
<asn:4>*<noident>
drivers/net/macvtap.c:151:22: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:233:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:243:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:247:15: error: incompatible types in comparison expression
(different address spaces)
  CC [M]  drivers/net/macvtap.o
drivers/net/macvlan.c:232:24: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:23:11 -07:00
Simon Horman 2c928e0e8d sctp: Correct byte order of access to skb->{network, transport}_header
Corrects an byte order conflict introduced by
158874cac6
("sctp: Correct access to skb->{network, transport}_header").
The values in question are host byte order.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 01:21:50 -07:00
Gao feng ca15febfe9 netlink: make compare exist all the time
Commit da12c90e09
"netlink: Add compare function for netlink_table"
only set compare at the time we create kernel netlink,
and reset compare to NULL at the time we finially
release netlink socket, but netlink_lookup wants
the compare exist always.

So we should set compare after we allocate nl_table,
and never reset it. make comapre exist all the time.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-13 00:45:48 -07:00
Cong Wang e3d73bcedf net: add doc for ip_early_demux sysctl
commit 6648bd7e0e (ipv4: Add sysctl knob to control
early socket demux) introduced such sysctl, but forgot to add
doc into Documentation/networking/ip-sysctl.txt. This patch adds it.

Basically I grab the doc from the description of commit 41063e9dd1
(ipv4: Early TCP socket demux.) and the above commit.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:10:13 -07:00
Pavel Emelyanov 944a1376b6 tun: Turn tun_flow_init() into void fn
This routine doesn't fail since 9fdc6bef (tuntap: dont use a private kmem_cache)
so it makes sense to compact the code a little bit.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:08:14 -07:00
Pavel Emelyanov 274038f8c9 tun: Report "persist" flag to userspace
The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs
"flags" attribute skip one. Knowing whether a device is persistent or not
is critical for checkpoint-restore, thus I propose to add the read-only
IFF_PERSIST one for this.

Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check
for unknown flags being zero and thus there can be trash.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:07:21 -07:00
Eric Dumazet 7c0cadc69c udp: fix two sparse errors
commit ba418fa357 ("soreuseport: UDP/IPv4 implementation")
added following sparse errors :

net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:433:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:433:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:514:60:    expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:514:60:    got restricted __be16 [usertype] sport
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:03:24 -07:00
Eric Dumazet 5b9b626377 gro: remove a sparse error
Fix following sparse error :

net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
integer

added in commit db8caf3dbc
("gro: should aggregate frames without DF")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
From: Eric Dumazet <edumazet@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:03:24 -07:00
David S. Miller 4a2e667ac1 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
This pull request is intended for the 3.11 stream...

One big highlight is the cw1200 driver the ST-E CW1100 & CW1200
WLAN chipsets.  This one has been lingering for a while, lacking
some review comments.  Once started getting pulled into linux-next,
it got a bit more attention and a number of improvements were made
over the initial cut.  No doubt there will be more changes ahead,
but I think it is looking alright at this point.

Along with that, there is the usual flurry of updates to the mac80211
core and the iwlwifi, mwifiex, ath9k, rt2x00, wil6210, and other
drivers.  A few of the highlights are some rt2x00 refactoring/cleanup
by Gabor Juhos, some rt2800 hardware support enhancements by Stanislaw
Gruszka, some iwlwifi power management updates from Alexander Bondar,
some enhanced bcma SPROM support from Rafał Miłecki, and a variety
of other things here and there.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 14:23:41 -07:00
Sergei Shtylyov 8f728d7934 sh_eth: split 'sh_eth_netdev_ops'
Commit 9f86134155 (sh_eth: remove SH_ETH_HAS_TSU)
removes 'const' from 'sh_eth_netdev_ops'  and modifies it in case TSU registers
are present. I've originally suggested to Iwamatsu-san to split  this structure
in two instead and afterwards Dave M. suggested doing the same.
Split 'sh_eth_netdev_ops_tsu' from 'sh_eth_netdev_ops', making both 'const', and
assigning 'ndev->detdev_ops'  depending on the presence of TSU registers.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 14:15:44 -07:00
Eric Dumazet c70eba7453 igmp: fix new sparse errors
Fix following sparse errors :

net/ipv4/igmp.c:1222:25: warning: cast from restricted __be32
net/ipv4/igmp.c🔢31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c🔢31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c🔢31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:1250:31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c:1250:31:    expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c:1250:31:    got struct ip_mc_list *<noident>
net/ipv4/igmp.c:2380:37: warning: cast from restricted __be32

These were added by commit e989707135
("igmp: hash a hash table to speedup ip_check_mc_rcu()")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 14:14:55 -07:00
Claudiu Manoil 5eaedf3131 gianfar: Add backwards compatible Single Queue mode polling
Older Single Queue (SQ_SG_MODE) devices like TSEC (i.e. mpc83xx)
don't feature the frame receive indication bits (RXF) in RSTAT.
For these and for the rest of the SQ_SG_MODE devices, provide the
appropiate polling routine that handles a single pair of Rx/Tx
BD rings, removing the overhead incurred by the multiple queues/
multiple interrupt group devices (veTSEC/ eTSEC2.0 devices).
So this is primarily a fix for the TSEC devices.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:16:20 -07:00
Ben Hutchings 6602041b83 sfc: Store port number in private data, not net_device::dev_id
We should not use net_device::dev_id to indicate the port number, as
this affects the way the local part of IPv6 addresses is normally
generated.

This field was intended for use where multiple devices may share a
single assigned MAC address and need to have different IPv6 addresses.
Siena's two ports each have their own MAC addresses.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:15:02 -07:00
Rami Rosen 503b47eafc ipv4: remove is_data also from ip_options documentation.
commit ef722495c8
( [IPV4]: Remove unused ip_options->is_data) removed the unused is_data
member from ip_options struct.

This patch removes is_data also from the documentation of the ip_options struct.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:13:50 -07:00
Jiri Pirko 735d381fa5 team: remove synchronize_rcu() called during port disable
Check the unlikely case of team->en_port_count == 0 before modulo
operation.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:06:16 -07:00
Jiri Pirko d80b35beac team: use kfree_rcu instead of synchronize_rcu in team_port_dev
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:05:54 -07:00
Jiri Pirko 6c31ff366c team: remove synchronize_rcu() called during queue override change
This patch removes synchronize_rcu() from function
__team_queue_override_port_del(). That can be done because it is ok to
do list_del_rcu() and list_add_tail_rcu() on the same list_head member
without calling synchronize_rcu() in between. A bit of refactoring
needed to be done because INIT_LIST_HEAD needed to be removed (to not
kill the forward pointer) as well.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:05:53 -07:00
Narendra K dffebd2c5c doc:networking: Update comment for dev_id field in netdevice.h
This patch updates the comment for 'dev_id' field in
'include/linux/netdevice.h' to reflect the intended
usage of 'dev_id'.

References: http://marc.info/?l=linux-netdev&m=136992115300526&w=2
References: http://marc.info/?l=linux-netdev&m=137062569014612&w=2

Signed-off-by: Narendra K <narendra_k@dell.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 03:03:47 -07:00
Tushar Behera 32766fff24 net: can: Convert to use devm_ioremap_resource
Commit 75096579c3 ("lib: devres: Introduce devm_ioremap_resource()")
introduced devm_ioremap_resource() and deprecated the use of
devm_request_and_ioremap().

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: netdev@vger.kernel.org
CC: linux-can@vger.kernel.org
CC: Marc Kleine-Budde <mkl@pengutronix.de>
CC: Wolfgang Grandegger <wg@grandegger.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 02:22:35 -07:00
Tushar Behera eed5d29d78 net: emaclite: Convert to use devm_ioremap_resource
Commit 75096579c3 ("lib: devres: Introduce devm_ioremap_resource()")
introduced devm_ioremap_resource() and deprecated the use of
devm_request_and_ioremap().

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: netdev@vger.kernel.org
CC: "David S. Miller" <davem@davemloft.net>
CC: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 02:22:35 -07:00
Tushar Behera 941e173a53 net: fec: Convert to use devm_ioremap_resource
Commit 75096579c3 ("lib: devres: Introduce devm_ioremap_resource()")
introduced devm_ioremap_resource() and deprecated the use of
devm_request_and_ioremap().

Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
CC: netdev@vger.kernel.org
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 02:22:35 -07:00
Sergei Shtylyov afd6eae13c 3c59x: consolidate error cleanup in vortex_init_one()
The PCI driver's probe() method  duplicates the error cleanup code each time it
has to do error exit. Consolidate the error cleanup code  in  one place and use
*goto* to jump to the right places.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 01:53:52 -07:00
Hong zhi guo 3f8b96379a veth: remove redundant call of dev_alloc_name
it's called in the following register_netdevice. No need to call it
here.
Tested with "ip link add type veth" and "ip link add xxx%d type veth".

Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 01:21:20 -07:00
Daniel Borkmann 7a6e288d27 pktgen: ipv6: numa: consolidate skb allocation to pktgen_alloc_skb
We currently allow for numa-node aware skb allocation only within the
fill_packet_ipv4() path, but not in fill_packet_ipv6(). Consolidate that
code to a common allocation helper to enable numa-node aware skb
allocation for ipv6, and use it in both paths. This also makes both
functions a bit more readable.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:47:25 -07:00
Daniel Borkmann da5bab079f net: udp4: move GSO functions to udp_offload
Similarly to TCP offloading and UDPv6 offloading, move all related
UDPv4 functions to udp_offload.c to make things more explicit. Also,
by this, we can make those functions static.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:47:25 -07:00
Shawn Bohrer 946d3bd723 igmp: remove unnecessary in_device member zeroing
ip_mc_init_dev() is passed a freshly kzalloc'd in_device so it is
unnecessary to explicitly zero out the members.

Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:41:15 -07:00
Eric Dumazet e989707135 igmp: hash a hash table to speedup ip_check_mc_rcu()
After IP route cache removal, multicast applications using
a lot of multicast addresses hit a O(N) behavior in ip_check_mc_rcu()

Add a per in_device hash table to get faster lookup.

This hash table is created only if the number of items in mc_list is
above 4.

Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:25:23 -07:00
Eric Dumazet 64153ce0a7 net_sched: htb: do not setup default rate estimators
With a thousand htb classes, est_timer() spends ~5 million cpu cycles
and throws out cpu cache, because each htb class has a default
rate estimator (est 4sec 16sec).

Most users do not use default rate estimators, so switch htb
to not setup ones.

Add a module parameter (htb_rate_est) so that users relying
on this default rate estimator can revert the behavior.

echo 1 >/sys/module/sch_htb/parameters/htb_rate_est

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 00:14:21 -07:00
Eric Dumazet 130d3d68b5 net_sched: psched_ratecfg_precompute() improvements
Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 22:39:47 -07:00
John W. Linville 3899ba90a4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
Conflicts:
	drivers/net/wireless/ath/ath9k/debug.c
	net/mac80211/iface.c
2013-06-11 14:48:32 -04:00
Solomon Peachy 8b3e7be437 cw1200: Fix an assorted pile of checkpatch warnings.
Signed-off-by: Solomon Peachy <pizza@shaftnet.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2013-06-11 12:48:10 -04:00
Solomon Peachy 19db577868 cw1200: Eliminate the ETF debug/engineering code.
This is only really useful for people who are bringing up new hardware
designs and have access to the proprietary vendor tools that interface
with this mode.

It'll live out of tree until it's rewritten to use a less kludgy interface.

Signed-off-by: Solomon Peachy <pizza@shaftnet.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2013-06-11 12:48:10 -04:00
Solomon Peachy fa8eeae102 cw1200: Remove "ITP" debug subsystem.
This can live on as an out-of-tree patch for those that care.

Signed-off-by: Solomon Peachy <pizza@shaftnet.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2013-06-11 12:48:09 -04:00
Eric Dumazet 45203a3b38 net_sched: add 64bit rate estimators
struct gnet_stats_rate_est contains u32 fields, so the bytes per second
field can wrap at 34360Mbit.

Add a new gnet_stats_rate_est64 structure to get 64bit bps/pps fields,
and switch the kernel to use this structure natively.

This structure is dumped to user space as a new attribute :

TCA_STATS_RATE_EST64

Old tc command will now display the capped bps (to 34360Mbit), instead
of wrapped values, and updated tc command will display correct
information.

Old tc command output, after patch :

eric:~# tc -s -d qd sh dev lo
qdisc pfifo 8001: root refcnt 2 limit 1000p
 Sent 80868245400 bytes 1978837 pkt (dropped 0, overlimits 0 requeues 0)
 rate 34360Mbit 189696pps backlog 0b 0p requeues 0

This patch carefully reorganizes "struct Qdisc" layout to get optimal
performance on SMP.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:51:03 -07:00
Peter Pan(潘卫平) b41abb42bf net: pass correct parameter to skb_headers_offset_update()
Since commit 1a37e412a022(net: Use 16bits for *_headers fields of struct
skbuff), skb->*_header are relative to skb->head,
so copy_skb_header() should not call skb_headers_offset_update() now,
and we should pass correct parameter to skb_headers_offset_update() in
pskb_expand_head() and skb_copy_expand().

Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:48:06 -07:00
Gao feng da12c90e09 netlink: Add compare function for netlink_table
As we know, netlink sockets are private resource of
net namespace, they can communicate with each other
only when they in the same net namespace. this works
well until we try to add namespace support for other
subsystems which use netlink.

Don't like ipv4 and route table.., it is not suited to
make these subsytems belong to net namespace, Such as
audit and crypto subsystems,they are more suitable to
user namespace.

So we must have the ability to make the netlink sockets
in same user namespace can communicate with each other.

This patch adds a new function pointer "compare" for
netlink_table, we can decide if the netlink sockets can
communicate with each other through this netlink_table
self-defined compare function.

The behavior isn't changed if we don't provide the compare
function for netlink_table.

Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:39:42 -07:00
Li RongQing 8249152c47 xen-netfront: use skb_partial_csum_set() to simplify the codes
use skb_partial_csum_set() to simplify the codes

Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:34:36 -07:00
David S. Miller 2e42206948 Merge branch 'bridge_flags'
Vlad Yasevich says:

====================
The following series adds 2 new flags to bridge.  One flag allows
the user to control whether mac learning is performed on the interface
or not.  By default mac learning is on.
The other flag allows the user to control whether unicast traffic
is flooded (send without an fdb) to a given unicast port.  Default is
on.

Changes since v4:
 - Implemented Stephen's suggestions.

Changes since v2:
 - removed unused "unlock" tag.

Changes since v1:
 - Integrated suggestion from MST to not impact RTM_NEWNEIGH and to
   skip lookups when learning is disabled.

Vlad Yasevich (2):
  bridge: Add flag to control mac learning.
  bridge: Add a flag to control unicast packet flood.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:04:43 -07:00
Vlad Yasevich 867a59436f bridge: Add a flag to control unicast packet flood.
Add a flag to control flood of unicast traffic.  By default, flood is
on and the bridge will flood unicast traffic if it doesn't know
the destination.  When the flag is turned off, unicast traffic
without an FDB will not be forwarded to the specified port.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:04:32 -07:00
Vlad Yasevich 9ba18891f7 bridge: Add flag to control mac learning.
Allow user to control whether mac learning is enabled on the port.
By default, mac learning is enabled.  Disabling mac learning will
cause new dynamic FDB entries to not be created for a particular port.

Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-11 02:04:32 -07:00
Cong Wang 30f3a40f9a net: remove last caller of skb_tail_offset() and itself
Similar to the following commits:

commit 00f97da17a (netpoll: fix position of network header)
commit 525cebedb3 (pktgen: Fix position of ip and udp header)

using skb_tail_offset() seems not correct since the offset
is based on head pointer.

With the last caller removed, skb_tail_offset() can be killed
finally.

Cc: Thomas Graf <tgraf@suug.ch>
Cc: Daniel Borkmann <dborkmann@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 22:22:23 -07:00
David S. Miller 0a4db187a9 Merge branch 'll_poll'
Eliezer Tamir says:

====================
This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from
last year
http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.

Performance numbers:
     setup                         TCP_RR           UDP_RR
kernel  Config     C3/6 rx-usecs tps cpu% S.dem  tps cpu% S.dem
patched optimized  on   100      87k 3.13 11.4   94K 3.17 10.7
patched optimized  on   0        71k 3.12 14.0   84k 3.19 12.0
patched optimized  on   adaptive 80k 3.13 12.5   90k 3.46 12.2
patched typical    on   100      72  3.13 14.0   79k 3.17 12.8
patched typical    on   0        60k 2.13 16.5   71k 3.18 14.0
patched typical    on   adaptive 67k 3.51 16.7   75k 3.36 14.5
3.9     optimized  on   adaptive 25k 1.0  12.7   28k 0.98 11.2
3.9     typical    off  0        48k 1.09  7.3   52k 1.11 4.18
3.9     typical    0ff  adaptive 35k 1.12 4.08   38k 0.65 5.49
3.9     optimized  off  adaptive 40k 0.82 4.83   43k 0.70 5.23
3.9     optimized  off  0        57k 1.17 4.08   62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical
NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down
config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive,
100 us
When C3/6 states were turned on (via BIOS) the performance governor
was used.

These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.

Design:
A global hash table that allows us to look up a struct napi by a
unique id was added.

A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific
socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the
protocol handler.

When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally
disables busy-polling)

Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a call to napi_hash_del(), caller must take care to wait an rcu
grace period before freeing the memory containing the napi struct.
(Ixgbe already had this because the queue vector structure uses rcu to
protect the statistics counters in it.)

how to test:

1. The patchset should apply cleanly to net-next.
(don't forget to configure INET_LL_RX_POLL).

2. The ethtool -c setting for rx-usecs should be on the order of 100.

3. Use ethtool -K to disable GRO and LRO
(You are encouraged to try it both ways. If you find that your
workload
does better with GRO on do tell us.)

4. Sysctl value net.core.low_latency_poll controls how long
(in us) to busy-wait for more data, You are encouraged to play
with this and see what works for you. The default is now 0 so you need
to
set it to turn the feature on. I recommend a value around 50.

4. benchmark thread and IRQ should be bound to separate cores.
Both cores should be on the same CPU NUMA node as the NIC.
When the app and the IRQ run on the same CPU  you get a small penalty.
If interrupt coalescing is set to a low value this penalty can be very
large.

5. If you suspect that your machine is not configured properly,
use numademo to make sure that the CPU to memory BW is OK.
numademo 128m memcpy local copy numbers should be more than
8GB/s on a properly configured machine.

Change log:
v10
- removed select/poll support. (we will work on this some more and try again)
v9
- correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai.
- more int -> bool changes, reported by Eric Dumazet.
- better mask testing in sock_poll(), reported by Eric Dumazet.

v8
- split out udp and select/poll into separate patches.
  what used to be patch 2/5 is now three patches.
- type corrections from Amir Vadai and Cong Wang:
  one unsigned long that was left when changing to cycles_t
  int -> bool
- more detailed patch descriptions.

v7
- suggested by Ben Hutchings and Eric Dumazet:
  type fixes, static for globals in net/core.c,
  avoid napi_id collisions in napi_hash_add()

v6
- many small fixes suggested by Eric Dumazet:
  data locality, typos, documentation
  protect napi_hash insert/delete with a spinlock (napi_gen_id is no
  longer atomic_t since it's only accessed with the spinlock held.)
- added IPv6 TCP and UDP support (only minimally tested)

v5
- corrections suggested by Ben Hutchings:
  fixed typos, moved the config option and sysctl value from IPv4 to net
- moved sk_mark_ll() to the protocol handlers
- removed global id mechanism, replaced with a hashed napi_id.
  based on code sample from Eric Dumazet
  Note that ixgbe_free_q_vector() already waits an rcu grace period
  before freeing the q_vector, so nothing additional needs to be done
  when adding a call to napi_hash_del().
- simple poll/select support

v4
- removed separate config option for TCP as suggested Eric Dumazet.
- added linux mib counter for packets received through the low latency path,
  as suggested by Andi Kleen.
- re-allow module unloading, remove module param, use a global generation id
  instead to prevent the use of a stale napi pointer, as suggested
  by Eric Dumazet
- updated Documentation/networking/ip-sysctl.txt text

v3
- coding style changes suggested by Dave Miller

v2
- the sysctl knob is now in microseconds. The default value is now 0 (off).
- for now the code depends at configure time on CONFIG_I86_TSC
- the napi reference in struct skb is now a union with the dma cookie
  since the former is only used on RX and the latter on TX,
  as suggested by Eric Dumazet.
- we do a better job at honoring non-blocking operations.
- removed busy-polling support for tcp_read_sock()
- remove dynamic disabling of GRO
- coding style fixes
- disallow unloading the device module after the feature has been used

Credit:
Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings,
Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li,
Mike Polehn, Anil Vasudevan, Don Wood
Special thanks for finding bugs in earlier versions:
Willem de Bruijn and Andi Kleen
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:23:57 -07:00
Eliezer Tamir 7e15b90ff9 ixgbe: add extra stats for ndo_ll_poll
Add additional statistics to the ixgbe driver for ndo_ll_poll
Defined under LL_EXTENDED_STATS

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-10 21:22:36 -07:00