OpenCloudOS-Kernel

Commit Graph

Author	SHA1	Message	Date
David S. Miller	4be3158abe	Merge branch 'mlxsw-spectrum' Jiri Pirko says: ==================== mlxsw: Driver update, add initial support for Spectrum ASIC Purpose of this patchset is to introduce initial support for Mellanox Spectrum ASIC, including L2 bridge forwarding offload. The only non-mlxsw patch in this patchset is the first one, introducing pre-change upper notifier. That is used in last patch to ensure ports of single ASIC are not bridged into multiple bridges, as that scenario is currently not supported by driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:31 -07:00
Jiri Pirko	56ade8fe3f	mlxsw: spectrum: Add initial support for Spectrum ASIC Add support for new generation Mellanox Spectrum ASIC, 10/25/40/50 and 100Gb/s Ethernet Switch. The initial driver implements bridge forwarding offload including bridge internal VLAN support, FDB static entries, FDB learning and HW ageing including their setup. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:23 -07:00
Ido Schimmel	a4feea74cd	mlxsw: reg: Add Switch Port VLAN MAC Learning register definition Since we currently do not support the offloading of 802.1D bridges, we need to be able to let the device know it should not learn MAC addresses on specific {Port, VID} pairs. Add the SPVMLR register, which controls the learning enablement of {Port, VID} pairs. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:22 -07:00
Jiri Pirko	e534a56a31	mlxsw: reg: Add Switch Filtering Database Aging Time register definition Add SFDAT which is used to control switch ageing time. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:22 -07:00
Ido Schimmel	1f65da742d	mlxsw: reg: Add Switch Virtual-Port Enabling register definition In order for a port to support {Port, VID} to FID mapping it needs to be configured to a virtual port mode (as opposed to VLAN mode). Add the SVPE register, which enables port virtualization. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:20 -07:00
Ido Schimmel	6479023976	mlxsw: reg: Add Switch VID to FID Allocation register definition An incoming packet can be classified into a filtering identifer (FID) based on its VID or incoming port and VID ({Port, VID}). Add the SVFA register, which controls this mapping. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:19 -07:00
Ido Schimmel	f1fb693a08	mlxsw: reg: Add Switch FID Management register definition Filtering identifiers (FIDs) are unique identifers of bridge instances in the hardware. Add the SFMR register, which is responsible for the creation and configuration of these FIDs. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:18 -07:00
Jiri Pirko	e059436999	mlxsw: reg: Add shared buffer configuration registers definitions Add definitions of SBPR, SBCM, SBPM, SBMM and PBMC registers that are used to configure shared buffers. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:18 -07:00
Elad Raz	b2e345f9a4	mlxsw: reg: Add Switch Port VID and Switch Port VLAN Membership registers definitions Add SPVID and SPVM registers responsible for default port VID configuration and VLAN membership of a port. Signed-off-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:17 -07:00
Jiri Pirko	f5d88f5892	mlxsw: reg: Add Switch FDB Notification register definition Add SFN register which is used to poll for newly added and aged-out FDB entries. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:17 -07:00
Jiri Pirko	236033b33c	mlxsw: reg: Add Switch Filtering Database register definition Add the SFD register which is responsible for filtering database manipulation, including static and dynamic FDB entries. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:15 -07:00
Jiri Pirko	d64b159253	mlxsw: item: Add MLXSW_ITEM_BUF_INDEXED helper Add missing item helper which allows to access char bufs on multiple offsets. This is needed by SFD and SFN register definitions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:14 -07:00
Jiri Pirko	7b0989b5bc	mlxsw: item: Make src arg of memcpy_to helper const Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:12 -07:00
Ido Schimmel	12fd35ab8a	mlxsw: cmd: Introduce FID-offset flooding tables Packets destined to offloaded netdevs will be classified to FIDs in the device and flooded in case of BUM. The flooding table used is of type FID-offset, which allows one to create different flooding domains for different FIDs and specify the offset in the flooding table for each FID (not necessarily equal to FID or VID). Add support for this flooding table type, by exposing the configuration of the number of tables from this type and their size. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:10 -07:00
Ido Schimmel	453b6a8dd8	mlxsw: cmd: Introduce per-FID flooding tables In the newly introduced Spectrum switch ASIC, packets destined to not offloaded netdevs will be classified to special FIDs (vFIDs) in the device and flooded to the CPU port. The flooding table used is of type per-FID, which allows one to create different flooding domains for different vFIDs. While using a simple single-entry flood table is certainly sufficient at this point, we do plan to offload 802.1D bridges involving VLAN interfaces, thus making this change necessary. Add support for this flooding table type, by exposing the configuration of the number of tables from this type and their size. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:08 -07:00
Ido Schimmel	bc2055f878	mlxsw: Enable configuration of flooding domains As part of the introduction of L2 offloads, allow different ports to join/leave the flooding domain, according to user configuration. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:08 -07:00
Jiri Pirko	573c7ba006	net: introduce pre-change upper device notifier This newly introduced netdevice notifier is called before actual change upper happens. That provides a possibility for notifier handlers to know upper change will happen and react to it, including possibility to forbid the change. That is valuable for drivers which can check if the upper device linkage is supported and forbid that in case it is not. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 07:15:05 -07:00
David S. Miller	125ecf4b59	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates 2015-10-16 This series contains updates to e1000, e1000e, igb, igbvf, ixgbe, ixgbevf, i40e, i40evf and fm10k. Alex Duyck fixes the polling routine for i40e/i40evf were the NAPI budget for receive cleanup was being rounded up to 1 but the netpoll call was expecting no Rx to be processed as the budget passed was 0. Also cleaned up IN_NETPOLL flag that was not adding any value due to the receive cleanup was handled in NAPI. Added support for netpoll for i40evf as well. Jesse updates all of our drivers to use napi_complete_done() instead of napi_complete(), which allows us to use /sys/class/net/ethX/gro_flush_timeout. Added ethtool support to control and report the new Interrupt Limit register, since the XL710 hardware has a different interrupt moderation design that can support a limit of total interrupts per second per vector. Shannon cleans up startup log entries to cut down the number by putting a couple behind debug flags and combining others into single line. Added support to enable/disable printing VEB statistics via ethtool. Jingjing fixes a compile issue by adding const to functions that return strings that are not going to be modified. Greg Rose cleans up defines that were not used and were causing customer confusion. Greg Bowers adds support for setting a new bit in the Set Local LLDP MIB admin queue command Type field. Mitch fixes an issue where vlan_features field was set to the same value as netdev features field, but before the features were actually being set up, leaving the vlan_features empty. Resolve the issue by setting up the netdev features first, then mask out the VLAN feature bits when assigning vlan_features. Fixed VF init timing, where in some instances the VFs would fail to initialize the first time you loaded the driver. To correct this, increased the delay time for the init task and wait longer before giving up. v2: fix missing space in function header comment in patch 3, based on feedback from Sergei Shtylyov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 06:41:10 -07:00
Catherine Sullivan	d1d39516e4	i40e/i40evf: Bump i40e to 1.3.34 and i40evf to 1.3.21 Bump. Change-ID: I7ec818a507554648675b9b245ced9e6b6bd9ed4e Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 05:05:08 -07:00
Mitch Williams	628f096d8d	i40e: increase AQ work limit With 64 VFs, we can easily overwhelm the AQ on the PF if we have too low a limit on the number of AQ requests. This leads to ARQ overflow errors, and occasionally VFs that fail to initialize. Since we really only hit this condition on initial VF driver load, the requests that we process are lightweight, so this extra work doesn't cause problems for the PF driver. Change-ID: I620221520d8af987df6ace9ba938ffaf22107681 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 05:02:40 -07:00
Mitch Williams	3f7e5c330e	i40evf: relax and stagger init timing a bit On some devices, in some systems, in some configurations, the VFs would fail to initialize the first time you loaded the driver. To correct this, increase the delay time for the init task slightly, and wait longer before giving up. If we enable VFs and load the VF driver in the same kernel as the PF driver, we can totally overwhelm the PF driver with AQ requests because all of the instances try to initialize at the same time. To help alleviate this, stagger the initial scheduling of the init task using the PCIe function as a multiplier. We mask off the function to only three bits so no instance has to wait too long. With these two changes, initializing 128 VFs on a single device goes from four minutes to just a few seconds. Change-ID: If3d8720c1c4e838ab36d8781d9ec295a62380936 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 05:00:17 -07:00
Catherine Sullivan	48becae60f	i40e: Recognize 1000Base_T_Optical phy type when link is up 1000Base_T_Optical got added to the function that figures out what is supported when link is down but not when link is up. Add it in there too so that we display the correct information. Change-ID: I85ebcdfa7c02d898c44c673b1500552a53c8042e Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:57:54 -07:00
Mitch Williams	cc7e406cb9	i40evf: correctly populate vlan_features The vlan_features field was correctly being set to the same value as the netdev features field. However, this was being done before the features were actually being set up, leaving the vlan_features empty. Also, after a reset, vlan_features will be incorrectly assigned the previous netdev feature flags, which can contain VLAN feature bits. This makes the VLAN code angry and will cause a stack dump. To fix these issues, set up the netdev features first, then mask out the VLAN feature bits when assigning vlan_features. Change-ID: Ib0548869dc83cf6a841cb8697dd94c12359ba4d2 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:55:29 -07:00
Jingjing Wu	5d38c93e71	i40e: reset the invalid msg counter in vf when a valid msg is received When the number of invalid messages from a VF is exceeded, the VF will be disabled, due to the invalid messages. This happens if other VF drivers (like DPDK) send a message through the driver's mailbox (aka virtchannel) interface, but the message is not supported by the i40e pf driver, such as CONFIG_PROMISCUOUS_MODE. This patch changes the num_invalid_msgs in struct i40e_vf to record the continuous invalid msgs, and it will be reset when a valid msg is received. Change-ID: Iaec42fd3dcdd281476b3518be23261dd46fc3718 Signed-off-by: Jingjing Wu <jingjing.wu@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:53:04 -07:00
Jesse Brandeburg	ac26fc136c	i40e/i40evf: moderate interrupts differently The XL710 hardware has a different interrupt moderation design that can support a limit of total interrupts per second per vector, in addition to the "number of interrupts per second" controls already established in the driver. This combination of hardware features allows us to set very low default latency settings but minimize the total CPU utilization by not making too many interrupts, should the user desire. The current driver implementation is still enabling the dynamic moderation in the driver, and only using the rx/tx-usecs limit in ethtool to limit the interrupt rate per second, by default. The new code implemented in this patch 2) adds init/use of the new "Interrupt Limit" register 3) adds ethtool knob to control/report the limits above Usage is ethtool -C ethx rx-usecs-high <value> Where <value> is number of microseconds to create a rate of 1/N interrupts per second, regardless of rx-usecs or tx-usecs values. Since there is a credit based scheme in the hardware, the rx-usecs and tx-usecs can be configured for very low latency for short bursts, but once the credit runs out the refill rate on the credits is limited by rx-usecs-high. Change-ID: I3a1075d3296123b0f4f50623c779b027af5b188d Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:50:38 -07:00
Greg Bowers	947570e800	i40e: Add support for non-willing Apps Adds support for setting a new bit in the Set Local LLDP MIB AQ command Type field. When set to 1, the bit indicates to FW that Apps should be treated as non-willing. When 0, FW behaves as before. Change-ID: I0d2101c1606c59c7188d3e6a0c7810e0f205233a Signed-off-by: Greg Bowers <gregory.j.bowers@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:48:11 -07:00
Shannon Nelson	1cdfd88f2d	i40e: priv flag for controlling VEB stats Add an ethtool priv flag to enable and disable printing the VEB statistics. Change-ID: I7654054a3a73b08aa8310d94ee8fce6219107dd8 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:45:47 -07:00
Greg Rose	d9d17cf74a	i40e: Removed unused defines Two defines that are not used are causing customer confusion - remove them. Change-ID: Icef0325aca8e0f4fcdfc519e026bdd375e791200 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:43:25 -07:00
Shannon Nelson	3c5c420535	i40e: remove read/write failed messages from nvmupdate Allow the nvmupdate application to decide when a read or write error should be exposed to the user. Since the application needs to use write probes to find the ReadOnly sections on a potentially unknown NVM version in the HW and read probes to check the status of the last write, some error messages are expected, but need not be shown to the users. The driver doesn't know which are ignorable from real errors, so needs to let the application make the decision. Change-ID: I78fca8ab672bede11c10c820b83c26adfd536d03 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:41:00 -07:00
Jingjing Wu	4e68adfeb9	i40e/i40evf: Fix compile issue related to const string Add const to functions that return strings that aren't going to be modified. This addresses some reported compile complaints. Change-ID: Ic56b1e814ab4d23a50480e7fdec652445f776ee8 Signed-off-by: Jingjing Wu <jingjing.wu@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:38:35 -07:00
Shannon Nelson	6dec101765	i40e: generate fewer startup messages Cut down on the number of startup log entries by putting a couple behind debug flags and combining a couple others into a single line. Change-ID: I708089f086308f84d43f8b6f0e8a634a02d058fb Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:36:13 -07:00
Jesse Brandeburg	32b3e08fff	drivers/net/intel: use napi_complete_done() As per Eric Dumazet's previous patches: (see commit (`24d2e4a507`) - tg3: use napi_complete_done()) Quoting verbatim: Using napi_complete_done() instead of napi_complete() allows us to use /sys/class/net/ethX/gro_flush_timeout GRO layer can aggregate more packets if the flush is delayed a bit, without having to set too big coalescing parameters that impact latencies. </end quote> Tested configuration: low latency via ethtool -C ethx adaptive-rx off rx-usecs 10 adaptive-tx off tx-usecs 15 workload: streaming rx using netperf TCP_MAERTS igb: MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo ... Interim result: 941.48 10^6bits/s over 1.000 seconds ending at 1440193171.589 Alignment Offset Bytes Bytes Recvs Bytes Sends Local Remote Local Remote Xfered Per Per Recv Send Recv Send Recv (avg) Send (avg) 8 8 0 0 1176930056 1475.36 797726 16384.00 71905 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo ... Interim result: 941.49 10^6bits/s over 0.997 seconds ending at 1440193142.763 Alignment Offset Bytes Bytes Recvs Bytes Sends Local Remote Local Remote Xfered Per Per Recv Send Recv Send Recv (avg) Send (avg) 8 8 0 0 1175182320 50476.00 23282 16384.00 71816 i40e: Hard to test because the traffic is incoming so fast (24Gb/s) that GRO always receives 87kB, even at the highest interrupt rate. Other drivers were only compile tested. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:33:46 -07:00
Alexander Duyck	7709b4c1ff	i40evf: Add support for netpoll Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:31:20 -07:00
Alexander Duyck	8b65035905	i40e/i40evf: Drop useless "IN_NETPOLL" flag The code in i40e and i40evf is using an "IN_NETPOLL" flag that has never added any value due to the fact that the Rx clean-up is handled in NAPI. As such the flag was set, the queue was scheduled via NAPI, and then polled from the netpoll controller and if any Rx packets were processed the were processed in the wrong context. In addition the flag itself just added an unneeded conditional to the hot-path so it can safely be dropped and save us a few instructions. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:28:57 -07:00
Alexander Duyck	c67caceb86	i40e/i40evf: Fix handling of napi budget The polling routine for i40e was rounding up the budget for Rx cleanup to 1. This is incorrect as the netpoll poll call is expecting no Rx to be processed as the budget passed was 0. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	2015-10-16 04:26:33 -07:00
David Ahern	51161aa98d	net: Fix suspicious RCU usage in fib_rebalance This command: ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2 generated this suspicious RCU usage message: [ 63.249262] [ 63.249939] =============================== [ 63.251571] [ INFO: suspicious RCU usage. ] [ 63.253250] 4.3.0-rc3+ #298 Not tainted [ 63.254724] ------------------------------- [ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage! [ 63.259450] [ 63.259450] other info that might help us debug this: [ 63.259450] [ 63.262297] [ 63.262297] rcu_scheduler_active = 1, debug_locks = 1 [ 63.264647] 1 lock held by ip/2870: [ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14 [ 63.268858] [ 63.268858] stack backtrace: [ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298 [ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301 [ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000 [ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908 [ 63.283177] Call Trace: [ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68 [ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103 [ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f [ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127 [ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f [ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc [ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75 [ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451 [ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43 [ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51 [ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195 [ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf [ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14 [ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12 [ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90 [ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28 [ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f [ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc [ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d [ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b [ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c [ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75 [ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6 [ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54 [ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77 [ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b [ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19 [ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f It looks like all of the code paths to fib_rebalance are under rtnl. Fixes: `0e884c78ee` ("ipv4: L3 hash-based multipath") Cc: Peter Nørlund <pch@ordbogen.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:57:55 -07:00
Tom Herbert	ac00737f4e	bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put Currently, is only called from __prog_put_rcu in the bpf_prog_release path. Need this to call this from bpf_prog_put also to get correct accounting. Fixes: `aaac3ba95e` ("bpf: charge user for creation of BPF maps and programs") Signed-off-by: Tom Herbert <tom@herbertland.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:55:02 -07:00
David S. Miller	a302afe980	Merge branch 'robust_listener' Eric Dumazet says: ==================== tcp/dccp: make our listener code more robust This patch series addresses request sockets leaks and listener dismantle phase. This survives a stress test with listeners being added/removed quite randomly. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:52:27 -07:00
Eric Dumazet	ebb516af60	tcp/dccp: fix race at listener dismantle phase Under stress, a close() on a listener can trigger the WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop() We need to test if listener is still active before queueing a child in inet_csk_reqsk_queue_add() Create a common inet_child_forget() helper, and use it from inet_csk_reqsk_queue_add() and inet_csk_listen_stop() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:52:19 -07:00
Eric Dumazet	f03f2e154f	tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helper Let's reduce the confusion about inet_csk_reqsk_queue_drop() : In many cases we also need to release reference on request socket, so add a helper to do this, reducing code size and complexity. Fixes: `4bdc3d6614` ("tcp/dccp: fix behavior of stale SYN_RECV request sockets") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:52:18 -07:00
Eric Dumazet	ef84d8ce5a	Revert "inet: fix double request socket freeing" This reverts commit `c69736696c`. At the time of above commit, tcp_req_err() and dccp_req_err() were dead code, as SYN_RECV request sockets were not yet in ehash table. Real bug was fixed later in a different commit. We need to revert to not leak a refcount on request socket. inet_csk_reqsk_queue_drop_and_put() will be added in following commit to make clean inet_csk_reqsk_queue_drop() does not release the reference owned by caller. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:52:17 -07:00
Ivan Vecera	47ea032533	drivers/net: get rid of unnecessary initializations in .get_drvinfo() Many drivers initialize uselessly n_priv_flags, n_stats, testinfo_len, eedump_len & regdump_len fields in their .get_drvinfo() ethtool op. It's not necessary as these fields is filled in ethtool_get_drvinfo(). v2: removed unused variable v3: removed another unused variable Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-16 00:24:10 -07:00
David S. Miller	ae23051820	Merge branch 'tipc-link-improvements' Jon Maloy says: ==================== tipc: some link level code improvements Extensive testing has revealed some weaknesses and non-optimal solutions in the link level code. This commit series addresses those issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:33 -07:00
Jon Paul Maloy	c819930090	tipc: update node FSM when peer RESET message is received The change made in the previous commit revealed a small flaw in the way the node FSM is updated. When the function tipc_node_link_down() is called for the last link to a node, we should check whether this was caused by a local reset or by a received RESET message from the peer. In the latter case, we can directly issue a PEER_LOST_CONTACT_EVT to the node FSM, so that it is ready to re-establish contact. If this is not done, the peer node will sometimes have to go through a second establish cycle before the link becomes stable. We fix this in this commit by conditionally issuing the mentioned event in the function tipc_node_link_down(). We also move LINK_RESET FSM even away from the link_reset() function and into the caller function, partially because it is easier to follow the code when state changes are gathered at a limited number of locations, partially because there will be cases in future commits where we don't want the link to go RESET mode when link_reset() is called. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:23 -07:00
Jon Paul Maloy	282b3a0562	tipc: send out RESET immediately when link goes down When a link is taken down because of a node local event, such as disabling of a bearer or an interface, we currently leave it to the peer node to discover the broken communication. The default time for such failure discovery is 1.5-2 seconds. If we instead allow the terminating link endpoint to send out a RESET message at the moment it is reset, we can achieve the impression that both endpoints are going down instantly. Since this is a very common scenario, we find it worthwhile to make this small modification. Apart from letting the link produce the said message, we also have to ensure that the interface is able to transmit it before TIPC is detached. We do this by performing the disabling of a bearer in three steps: 1) Disable reception of TIPC packets from the interface in question. 2) Take down the links, while allowing them so send out a RESET message. 3) Disable transmission of TIPC packets on the interface. Apart from this, we now have to react on the NETDEV_GOING_DOWN event, instead of as currently the NEDEV_DOWN event, to ensure that such transmission is possible during the teardown phase. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:22 -07:00
Jon Paul Maloy	73f646cec3	tipc: delay ESTABLISH state event when link is established Link establishing, just like link teardown, is a non-atomic action, in the sense that discovering that conditions are right to establish a link, and the actual adding of the link to one of the node's send slots is done in two different lock contexts. The link FSM is designed to help bridging the gap between the two contexts in a safe manner. We have now discovered a weakness in the implementaton of this FSM. Because we directly let the link go from state LINK_ESTABLISHING to state LINK_ESTABLISHED already in the first lock context, we are unable to distinguish between a fully established link, i.e., a link that has been added to its slot, and a link that has not yet reached the second lock context. It may hence happen that a manual intervention, e.g., when disabling an interface, causes the function tipc_node_link_down() to try removing the link from the node slots, decrementing its active link counter etc, although the link was never added there in the first place. We solve this by delaying the actual state change until we reach the second lock context, inside the function tipc_node_link_up(). This makes it possible for potentail callers of __tipc_node_link_down() to know if they should proceed or not, and the problem is solved. Unforunately, the situation described above also has a second problem. Since there by necessity is a tipc_node_link_up() call pending once the node lock has been released, we must defuse that call by setting the link back from LINK_ESTABLISHING to LINK_RESET state. This forces us to make a slight modification to the link FSM, which will now look as follows. +------------------------------------+ \|RESET_EVT \| \| \| \| +--------------+ \| +-----------------\| SYNCHING \|-----------------+ \| \|FAILURE_EVT +--------------+ PEER_RESET_EVT\| \| \| A \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|SYNCH_ \|SYNCH_ \| \| \| \|BEGIN_EVT \|END_EVT \| \| \| \| \| \| \| V \| V V \| +-------------+ +--------------+ +------------+ \| \| RESETTING \|<---------\| ESTABLISHED \|--------->\| PEER_RESET \| \| +-------------+ FAILURE_ +--------------+ PEER_ +------------+ \| \| EVT \| A RESET_EVT \| \| \| \| \| \| \| \| +----------------+ \| \| \| RESET_EVT\| \|RESET_EVT \| \| \| \| \| \| \| \| \| \| \|ESTABLISH_EVT \| \| \| \| +-------------+ \| \| \| \| \| \| RESET_EVT \| \| \| \| \| \| \| \| \| \| \| V V V \| \| \| \| +-------------+ +--------------+ RESET_EVT\| +--->\| RESET \|--------->\| ESTABLISHING \|<----------------+ +-------------+ PEER_ +--------------+ \| A RESET_EVT \| \| \| \| \| \| \| \|FAILOVER_ \|FAILOVER_ \|FAILOVER_ \|BEGIN_EVT \|END_EVT \|BEGIN_EVT \| \| \| V \| \| +-------------+ \| \| FAILINGOVER \|<----------------+ +-------------+ Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:21 -07:00
Jon Paul Maloy	8306f99a51	tipc: disallow packet duplicates in link deferred queue After the previous commits, we are guaranteed that no packets of type LINK_PROTOCOL or with illegal sequence numbers will be attempted added to the link deferred queue. This makes it possible to make some simplifications to the sorting algorithm in the function tipc_skb_queue_sorted(). We also alter the function so that it will drop packets if one with the same seqeunce number is already present in the queue. This is necessary because we have identified weird packet sequences, involving duplicate packets, where a legitimate in-sequence packet may advance to the head of the queue without being detected and de-queued. Finally, we make this function outline, since it will now be called only in exceptional cases. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:21 -07:00
Jon Paul Maloy	81204c492b	tipc: improve sequence number checking The sequence number of an incoming packet is currently only checked for less than, equality to, or bigger than the next expected number, meaning that the receive window in practice becomes one half sequence number cycle, or U16_MAX/2. This does not make sense, and may not even be safe if there are extreme delays in the network. Any packet sent by the peer during the ongoing cycle must belong inside his current send window, or should otherwise be dropped if possible. Since a link endpoint cannot know its peer's current send window, it has to base this sanity check on a worst-case assumption, i.e., that the peer is using a maximum sized window of 8191 packets. Using this assumption, we now add a check that the sequence number is not bigger than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks done, so that the receive window test is performed before the gap test. This way, we are guaranteed that no packet with illegal sequence numbers are ever added to the deferred queue. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:20 -07:00
Jon Paul Maloy	f9aa358a81	tipc: simplify tipc_link_rcv() reception loop Currently, all packets received in tipc_link_rcv() are unconditionally added to the packet deferred queue, whereafter that queue is walked and all its buffers evaluated for delivery. This is both non-optimal and and makes the queue sorting function unnecessary complex. This commit changes the loop so that an arrived packet is evaluated first, and added to the deferred queue only when a sequence number gap is discovered. A non-empty deferred queue is walked until it is empty or until its head's sequence number doesn't fit. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:19 -07:00
Jon Paul Maloy	9945e8043e	tipc: limit usage of temporary skb list during packet reception During packet reception, the function tipc_link_rcv() adds its accepted packets to a temporary buffer queue, before finally splicing this queue into the lock protected input queue that will be delivered up to the socket layer. The purpose is to reduce potential contention on the input queue lock. However, since the vast majority of packets arrive in sequence, they will anyway be added one by one to the input queue, and the use of the temporary queue becomes a sub-optimization. The only case where this queue makes sense is when unpacking buffers from a bundle packet; here we want to avoid dozens of small buffers to be added individually to the lock-protected input queue in a tight loop. In this commit, we remove the general usage of the temporary queue, and keep it only for the packet unbundling case. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-10-15 23:55:18 -07:00

1 2 3 4 5 ...

548675 Commits All Branches Search

548675 Commits

All Branches