Commit Graph

38979 Commits

Author SHA1 Message Date
Linus Lüssing 8a4023c5b5 batman-adv: Fix potential synchronization issues in mcast tvlv handler
So far the mcast tvlv handler did not anticipate the processing of
multiple incoming OGMs from the same originator at the same time. This
can lead to various issues:

* Broken refcounting: For instance two mcast handlers might both assume
  that an originator just got multicast capabilities and will together
  wrongly decrease mcast.num_disabled by two, potentially leading to
  an integer underflow.

* Potential kernel panic on hlist_del_rcu(): Two mcast handlers might
  one after another try to do an
  hlist_del_rcu(&orig->mcast_want_all_*_node). The second one will
  cause memory corruption / crashes.
  (Reported by: Sven Eckelmann <sven@narfation.org>)

Right in the beginning the code path makes assumptions about the current
multicast related state of an originator and bases all updates on that. The
easiest and least error prune way to fix the issues in this case is to
serialize multiple mcast handler invocations with a spinlock.

Fixes: 60432d756c ("batman-adv: Announce new capability via multicast TVLV")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-14 22:52:08 +02:00
Linus Lüssing 9c936e3f4c batman-adv: Make MCAST capability changes atomic
Bitwise OR/AND assignments in C aren't guaranteed to be atomic. One
OGM handler might undo the set/clear of a specific bit from another
handler run in between.

Fix this by using the atomic set_bit()/clear_bit()/test_bit() functions.

Fixes: 60432d756c ("batman-adv: Announce new capability via multicast TVLV")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-14 22:51:40 +02:00
Linus Lüssing ac4eebd484 batman-adv: Make TT capability changes atomic
Bitwise OR/AND assignments in C aren't guaranteed to be atomic. One
OGM handler might undo the set/clear of a specific bit from another
handler run in between.

Fix this by using the atomic set_bit()/clear_bit()/test_bit() functions.

Fixes: e17931d1a6 ("batman-adv: introduce capability initialization bitfield")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-14 22:50:43 +02:00
Linus Lüssing 4635469f5c batman-adv: Make NC capability changes atomic
Bitwise OR/AND assignments in C aren't guaranteed to be atomic. One
OGM handler might undo the set/clear of a specific bit from another
handler run in between.

Fix this by using the atomic set_bit()/clear_bit()/test_bit() functions.

Fixes: 3f4841ffb3 ("batman-adv: tvlv - add network coding container")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-14 22:50:43 +02:00
Linus Lüssing 65d7d46050 batman-adv: Make DAT capability changes atomic
Bitwise OR/AND assignments in C aren't guaranteed to be atomic. One
OGM handler might undo the set/clear of a specific bit from another
handler run in between.

Fix this by using the atomic set_bit()/clear_bit()/test_bit() functions.

Fixes: 17cf0ea455 ("batman-adv: tvlv - add distributed arp table container")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-14 22:50:42 +02:00
Johannes Berg 40d9a38ad3 mac80211: use DECLARE_EWMA
Instead of using the out-of-line average calculation, use the new
DECLARE_EWMA() macro to declare a signal EWMA, and use that.

This actually *reduces* the code size slightly (on x86-64) while
also reducing the station info size by 80 bytes.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:53 +02:00
Lorenzo Bianconi b119ad6e72 mac80211: add rate mask logic for vht rates
Define rc_rateidx_vht_mcs_mask array and rate_idx_match_vht_mcs_mask()
method in order to apply mcs mask for vht rates

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:51 +02:00
Lorenzo Bianconi e910867bd2 mac80211: define rate_control_apply_mask_ratetbl()
Define rate_control_apply_mask_ratetbl() in order to apply ratemask in
rate_control_set_rates() for station rate table

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:51 +02:00
Lorenzo Bianconi 90c66bd223 mac80211: remove ieee80211_tx_rate dependency in rate mask code
Remove ieee80211_tx_rate dependency in rate_idx_match_legacy_mask(),
rate_idx_match_mcs_mask() and rate_idx_match_mask() in order to use the
previous logic to define a ratemask in rate_control_set_rates() for
station rate table. Moreover move rate mask definition logic in
rate_control_cap_mask()

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:50 +02:00
Lorenzo Bianconi 35225eb7a5 mac80211: remove ieee80211_tx_info from rate_control_apply_mask signature
Remove unnecessary ieee80211_tx_info pointer from rate_control_apply_mask
signature. rate_control_apply_mask() will be used to define a ratemask in
rate_control_set_rates() for station rate table

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:50 +02:00
Bertold Van den Bergh 4b819f6cc4 mac80211: Make OCB mode set BSSID
Perform the BSS_CHANGED_BSSID action when joining an OCB network.
This is required to set the broadcast BSSID in some network drivers.

Signed-off-by: Bertold Van den Bergh <bertold.vandenbergh@esat.kuleuven.be>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:49 +02:00
Bertold Van den Bergh cc11729893 mac80211: Only accept data frames in OCB mode
Currently OCB mode accepts frames with bssid==broadcast and type!=beacon.
Some non-data frames are sent matching this, for example probe responses.
This results in unnecessary creation of STA entries.

Signed-off-by: Bertold Van den Bergh <bertold.vandenbergh@esat.kuleuven.be>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:49 +02:00
Bertold Van den Bergh 5765f9f66e mac80211: Set txrc.bss to true for OCB interfaces
To make mac80211 accept the multicast rate requested by the user the
rate control should be told that it is operating in BSS mode.
Without this, the default rate is selected in rate_control_send_low
(!pubsta and !txrc->bss)

Signed-off-by: Bertold Van den Bergh <bertold.vandenbergh@esat.kuleuven.be>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:48 +02:00
Bertold Van den Bergh 876dc9308e nl80211: Allow setting multicast rate on OCB interfaces
Allow setting multicast rate on OCB interfaces.
Current behaviour results in EOPNOTSUPP when attempting this.

Signed-off-by: Bertold Van den Bergh <bertold.vandenbergh@esat.kuleuven.be>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:48 +02:00
Michal Kazior 9189ee31df cfg80211: propagate set_wiphy failure to userspace
If driver failed to setup wiphy params (e.g. rts
threshold, fragmentation treshold) userspace
wasn't properly notified about this. This could
lead to user confusion who would think the command
succeeded even if that wasn't the case.

Signed-off-by: Michal Kazior <michal.kazior@tieto.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:49:47 +02:00
Matthias May 4edd56981c cfg80211: regulatory: handle 5 and 10 MHz channels properly
The original assumption of 20MHz wide channels hasn't been true since
the addition of support for 5 and 10 MHz channels.
Change the code to no longer disable all channels that don't fit into
the 20MHz grid, but instead set the appropriate flags to disable
operation on specific bandwidths.

Signed-off-by: Matthias May <matthias.may@neratec.com>
[reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-14 17:48:46 +02:00
Eric Dumazet 83fccfc394 inet: fix potential deadlock in reqsk_queue_unlink()
When replacing del_timer() with del_timer_sync(), I introduced
a deadlock condition :

reqsk_queue_unlink() is called from inet_csk_reqsk_queue_drop()

inet_csk_reqsk_queue_drop() can be called from many contexts,
one being the timer handler itself (reqsk_timer_handler()).

In this case, del_timer_sync() loops forever.

Simple fix is to test if timer is pending.

Fixes: 2235f2ac75 ("inet: fix races with reqsk timers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:46:22 -07:00
David Ahern 9972f134a2 net: frags: Add VRF device index to cache and lookup
Fragmentation cache uses information from the IP header to reassemble
packets. That information can be duplicated across VRFs -- same source
and destination addresses, protocol and id. Handle fragmentation with
VRFs by adding the VRF device index to entries in the cache and the
lookup arg.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David Ahern f7ba868b71 net: Use VRF index for oif in ip_send_unicast_reply
If output device is not specified use VRF device if input device is
enslaved. This is needed to ensure tcp acks and resets go out VRF device.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David Ahern 3bfd847203 net: Use passed in table for nexthop lookups
If a user passes in a table for new routes use that table for nexthop
lookups. Specifically, this solves the case where a connected route does
not exist in the main table, but only another table and then a subsequent
route is added with a next hop using the connected route. ie.,

$ ip route ls
default via 10.0.2.2 dev eth0
10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
169.254.0.0/16 dev eth0  scope link  metric 1003
192.168.56.0/24 dev eth1  proto kernel  scope link  src 192.168.56.51

$ ip route ls table 10
1.1.1.0/24 dev eth2  scope link

Without this patch adding a nexthop route fails:

$ ip route add table 10 2.2.2.0/24 via 1.1.1.10
RTNETLINK answers: Network is unreachable

With this patch the route is added successfully.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David Ahern 021dd3b8a1 net: Add routes to the table associated with the device
When a device associated with a VRF is brought up or down routes
should be added to/removed from the table associated with the VRF.
fib_magic defaults to using the main or local tables. Have it use
the table with the device if there is one.

A part of this is directing prefsrc validations to the correct
table as well.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David Ahern 30bbaa1950 net: Fix up inet_addr_type checks
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
if the passed in device is enslaved to a VRF then the table for that VRF
is used for the lookup.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David Ahern 15be405eb2 net: Add inet_addr lookup by table
Currently inet_addr_type and inet_dev_addr_type expect local addresses
to be in the local table. With the VRF device local routes for devices
associated with a VRF will be in the table associated with the VRF.
Provide an alternate inet_addr lookup to use a specific table rather
than defaulting to the local table.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:21 -07:00
David Ahern 9a24abfa42 udp: Handle VRF device in sendmsg
For unconnected UDP sockets using a VRF device lookup source address
based on VRF table. This allows the UDP header to be properly setup
before showing up at the VRF device via the dst.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:20 -07:00
David Ahern 613d09b30f net: Use VRF device index for lookups on TX
As with ingress use the index of VRF master device for route lookups on
egress. However, the oif should only be used to direct the lookups to a
specific table. Routes in the table are not based on the VRF device but
rather interfaces that are part of the VRF so do not consider the oif for
lookups within the table. The FLOWI_FLAG_VRFSRC is used to control this
latter part.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:20 -07:00
David Ahern cd2fbe1b6b net: Use VRF device index for lookups on RX
On ingress use index of VRF master device for route lookups if real device
is enslaved. Rules are expected to be installed for the VRF device to
direct lookups to a specific table.

Signed-off-by: Shrijeet Mukherjee <shm@cumulusnetworks.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:43:20 -07:00
Andy Gospodarek 0344338bd8 net: addr IFLA_OPERSTATE to netlink message for ipv6 ifinfo
This is useful information to include in ipv6 netlink messages that
report interface information.  IFLA_OPERSTATE is already included in
ipv4 messages, but missing for ipv6.  This closes that gap.

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 22:35:30 -07:00
Sasha Levin da65ad1fe3 net: allow sleeping when modifying store_rps_map
Commit 10e4ea751 ("net: Fix race condition in store_rps_map") has moved the
manipulation of the rps_needed jump label under a spinlock. Since changing
the state of a jump label may sleep this is incorrect and causes warnings
during runtime.

Make rps_map_lock a mutex to allow sleeping under it.

Fixes: 10e4ea751 ("net: Fix race condition in store_rps_map")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 21:40:57 -07:00
Vivien Didelot 1114953615 net: dsa: add support for switchdev VLAN objects
Add new functions in DSA drivers to access hardware VLAN entries through
SWITCHDEV_OBJ_PORT_VLAN objects:

 - port_pvid_get() and vlan_getnext() to dump a VLAN
 - port_vlan_del() to exclude a port from a VLAN
 - port_pvid_set() and port_vlan_add() to join a port to a VLAN

The DSA infrastructure will ensure that each VLAN of the given range
does not already belong to another bridge. If it does, it will fallback
to software VLAN and won't program the hardware.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 21:31:12 -07:00
Andy Gospodarek 35103d1117 net: ipv6 sysctl option to ignore routes when nexthop link is down
Like the ipv4 patch with a similar title, this adds a sysctl to allow
the user to change routing behavior based on whether or not the
interface associated with the nexthop was an up or down link.  The
default setting preserves the current behavior, but anyone that enables
it will notice that nexthops on down interfaces will no longer be
selected:

net.ipv6.conf.all.ignore_routes_with_linkdown = 0
net.ipv6.conf.default.ignore_routes_with_linkdown = 0
net.ipv6.conf.lo.ignore_routes_with_linkdown = 0
...

When the above sysctls are set, not only will link status be reported to
userspace, but an indication that a nexthop is dead and will not be used
is also reported.

1000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
1000::/8 via 8000::2 dev p8p1  metric 1024  pref medium
7000::/8 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
8000::/8 dev p8p1  proto kernel  metric 256  pref medium
9000::/8 via 8000::2 dev p8p1  metric 2048  pref medium
9000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
fe80::/64 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
fe80::/64 dev p8p1  proto kernel  metric 256  pref medium

This also adds devconf support and notification when sysctl values
change.

v2: drop use of rt6i_nhflags since it is not needed right now

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 21:27:19 -07:00
Andy Gospodarek cea45e208d net: track link status of ipv6 nexthops
Add support to track current link status of ipv6 nexthops to match
recent changes that added support for ipv4 nexthops.  This takes a
simple approach to track linkdown status for next-hops and simply
checks the dev for the dst entry and sets proper flags that to be used
in the netlink message.

v2: drop use of rt6i_nhflags since it is not needed right now

Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Dinesh Dutt <ddutt@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 21:27:18 -07:00
Andy Whitcroft 25b97c016b ipv4: off-by-one in continuation handling in /proc/net/route
When generating /proc/net/route we emit a header followed by a line for
each route.  When a short read is performed we will restart this process
based on the open file descriptor.  When calculating the start point we
fail to take into account that the 0th entry is the header.  This leads
us to skip the first entry when doing a continuation read.

This can be easily seen with the comparison below:

  while read l; do echo "$l"; done </proc/net/route >A
  cat /proc/net/route >B
  diff -bu A B | grep '^[+-]'

On my example machine I have approximatly 10KB of route output.  There we
see the very first non-title element is lost in the while read case,
and an entry around the 8K mark in the cat case:

  +wlan0 00000000 02021EAC 0003 0 0 400 00000000 0 0 0
  -tun1  00C0AC0A 00000000 0001 0 0 950 00C0FFFF 0 0 0

Fix up the off-by-one when reaquiring position on continuation.

Fixes: 8be33e955c ("fib_trie: Fib walk rcu should take a tnode and key instead of a trie and a leaf")
BugLink: http://bugs.launchpad.net/bugs/1483440
Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 21:16:29 -07:00
Linus Lüssing a516993f0a net: fix wrong skb_get() usage / crash in IGMP/MLD parsing code
The recent refactoring of the IGMP and MLD parsing code into
ipv6_mc_check_mld() / ip_mc_check_igmp() introduced a potential crash /
BUG() invocation for bridges:

I wrongly assumed that skb_get() could be used as a simple reference
counter for an skb which is not the case. skb_get() bears additional
semantics, a user count. This leads to a BUG() invocation in
pskb_expand_head() / kernel panic if pskb_may_pull() is called on an skb
with a user count greater than one - unfortunately the refactoring did
just that.

Fixing this by removing the skb_get() call and changing the API: The
caller of ipv6_mc_check_mld() / ip_mc_check_igmp() now needs to
additionally check whether the returned skb_trimmed is a clone.

Fixes: 9afd85c9e4 ("net: Export IGMP/MLD message validation code")
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 17:08:39 -07:00
Yuchung Cheng b340b26454 tcp: TLP retransmits last if failed to send new packet
When TLP fails to send new packet because of receive window
limit, it should fall back to retransmit the last packet instead.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 16:52:20 -07:00
Yuchung Cheng fcd16c0a95 tcp: don't extend RTO on failed loss probe attempts
If TLP was unable to send a probe, it extended the RTO to
now + icsk_rto. But extending the RTO makes little sense
if no TLP probe went out. With this commit, instead of
extending the RTO we re-arm it relative to the transmit time
of the write queue head.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 16:52:19 -07:00
David S. Miller 182ad468e7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/cavium/Kconfig

The cavium conflict was overlapping dependency
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-13 16:23:11 -07:00
Kinglong Mee 129e5824cd sunrpc: Switch to using hash list instead single list
Switch using list_head for cache_head in cache_detail,
it is useful of remove an cache_head entry directly from cache_detail.

v8, using hash list, not head list

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-13 08:59:03 -04:00
Kinglong Mee c8c081b70c sunrpc/nfsd: Remove redundant code by exports seq_operations functions
Nfsd has implement a site of seq_operations functions as sunrpc's cache.
Just exports sunrpc's codes, and remove nfsd's redundant codes.

v8, same as v6

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-13 08:59:02 -04:00
Kinglong Mee 9936f2ae37 sunrpc: Store cache_detail in seq_file's private directly
Cleanup.

Just store cache_detail in seq_file's private,
an allocated handle is redundant.

v8, same as v6.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-13 08:59:01 -04:00
Adrien Schildknecht f5eeb5fa19 mac80211: fix invalid read in minstrel_sort_best_tp_rates()
At the last iteration of the loop, j may equal zero and thus
tp_list[j - 1] causes an invalid read.
Change the logic of the loop so that j - 1 is always >= 0.

Cc: stable@vger.kernel.org
Signed-off-by: Adrien Schildknecht <adrien+dev@schischi.me>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-13 13:52:34 +02:00
Dan Carpenter 75dbf00b44 mac80211: remove always true condition
The outside if statement checks that IEEE80211_TX_INTFL_MLME_CONN_TX is
set so this condition is always true.  Checking twice upsets the static
checkers.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-13 11:36:01 +02:00
Johannes Berg 4b58c37bb9 mac80211: remove ieee80211_aes_cmac_calculate_k1_k2()
The iwlwifi driver was the only driver that used this, but as
it turns out it never needed it, so we can remove it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-13 11:31:45 +02:00
Geert Uytterhoeven 36890997b0 rfkill: Allow compile test of GPIO consumers if !GPIOLIB
The GPIO subsystem provides dummy GPIO consumer functions if GPIOLIB is
not enabled. Hence drivers that depend on GPIOLIB, but use GPIO consumer
functionality only, can still be compiled if GPIOLIB is not enabled.

Relax the dependency on GPIOLIB if COMPILE_TEST is enabled, where
appropriate.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-08-13 10:48:52 +02:00
Mugunthan V N 76550786c6 net: ipv4: increase dhcp inter device timeout
When a system has multiple ethernet devices and during DHCP
request (for using NFS), the system waits only for HZ/2 which is
500mS before switching to another interface for DHCP.

There are some routers (Ex: Trendnet routers) which responds to
DHCP request at about 560mS. When the system has only one
ethernet interface there is no issue as the timeout is 2S and the
dev xid doesn't changes and only retries.

But when the system has multiple Ethernet like DRA74x with CPSW
in dual EMAC mode, the DHCP response is dropped as the dev xid
changes while shifting to the next device. So changing inter
device timeout to HZ (which is 1S).

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12 16:40:22 -07:00
Florian Fainelli 211c504a44 net: dsa: Do not override PHY interface if already configured
In case we need to divert reads/writes using the slave MII bus, we may have
already fetched a valid PHY interface property from Device Tree, and that
mode is used by the PHY driver to make configuration decisions.

If we could not fetch the "phy-mode" property, we will assign p->phy_interface
to PHY_INTERFACE_MODE_NA, such that we can actually check for that condition as
to whether or not we should override the interface value.

Fixes: 19334920ea ("net: dsa: Set valid phy interface type")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12 14:29:50 -07:00
Jeff Layton 24a9a9610c sunrpc: increase UNX_MAXNODENAME from 32 to __NEW_UTS_LEN bytes
The current limit of 32 bytes artificially limits the name string that
we end up stuffing into NFSv4.x client ID blobs. If you have multiple
hosts with long hostnames that only differ near the end, then this can
cause NFSv4 client ID collisions.

Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use
that as the limit instead. Also, use XDR_QUADLEN to specify the slack
length, just for clarity and in case someone in the future changes this
to something not evenly divisible by 4.

Reported-by: Michael Skralivetsky <michael.skralivetsky@primarydata.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-12 14:31:04 -04:00
David S. Miller 28eaad7504 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2015-08-11

Here's an important regression fix for the 4.2-rc series that ensures
user space isn't given invalid LTK values. The bug essentially prevents
the encryption of subsequent LE connections, i.e. makes it impossible to
pair devices over LE.

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-11 14:16:07 -07:00
Alexander Aring 4ae935c127 6lowpan: move module_init into core functionality
This patch moves module_init of 6lowpan module into core functionality
of 6lowpan module. To load the ipv6 module at probing of the 6lowpan
module should be core functionality. Loading next header compression
modules is iphc specific. Nevertheless we only support IPHC for the
generic 6LoWPAN branch right now so we can put it into the core
functionality. If possible new compression formats are introduced nhc
should load only when iphc is build.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-11 22:05:36 +02:00
Alexander Aring b72f6f51dc 6lowpan: add generic 6lowpan netdev private data
This patch introduced the 6lowpan netdev private data struct. We name it
lowpan_priv and it's placed at the beginning of netdev private data. All
lowpan interfaces should allocate this room at first of netdev private
data. 6LoWPAN LL private data can be allocate by additional netdev private
data, e.g. dev->priv_size should be "sizeof(struct lowpan_priv) +
sizeof(LL_LOWPAN_PRIVATE_DATA)".

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-11 22:05:36 +02:00
Alexander Aring a42bbba5af Bluetooth: 6lowpan: change netdev_priv to lowpan_dev
The usually way to get the btle lowpan private data is to use the
introduced lowpan_dev inline function. This patch will cleanup by using
lowpan_dev consequently.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-11 22:05:36 +02:00
Frederic Danis 9e6080936c net: rfkill: gpio: Remove BCM2E39 support
Power management support for BCM2E39 is now performed in Bluetooth
BCM UART driver.

Signed-off-by: Frederic Danis <frederic.danis@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-11 21:26:27 +02:00
Vivien Didelot ba14d9eb19 net: dsa: add support for switchdev FDB objects
Implement the switchdev_port_obj_{add,del,dump} functions in DSA to
support the SWITCHDEV_OBJ_PORT_FDB objects.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-11 12:03:19 -07:00
Vivien Didelot ce80e7bc57 net: switchdev: support static FDB addresses
This patch adds an ndm_state member to the switchdev_obj_fdb structure,
in order to support static FDB addresses.

Set Rocker ndm_state to NUD_REACHABLE.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-11 12:03:19 -07:00
Vivien Didelot 2a778e1b58 net: dsa: change FDB routines prototypes
Change the prototype of port_getnext to include a vid parameter.

This is necessary to introduce the support for VLAN.

Also rename the fdb_{add,del,getnext} function pointers to
port_fdb_{add,del,getnext} since they are specific to a given port.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-11 12:03:19 -07:00
David S. Miller cdf0969763 Revert "Merge branch 'mv88e6xxx-switchdev-fdb'"
This reverts commit f1d5ca4344, reversing
changes made to 4933d85c51.

I applied v2 instead of v3.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-11 12:00:37 -07:00
Ruben Wisniewski 571a963768 batman-adv: Avoid u32 overflow during gateway select
The gateway selection based on fast connections is using a single value
calculated from the average tq (0-255) and the download bandwidth (in
100Kibit). The formula for the first step (tq ** 2 * 10000 * bandwidth)
tends to overflow a u32 with low bandwidth settings like 50 [100KiBit]
and a tq value of over 92.

Changing this to a 64 bit unsigned integer allows to support a
bandwidth_down with up to ~2.8e10 [100KiBit] and a perfect tq of 255. This
is ~6.6 times higher than the maximum possible value of the gateway
announcement TVLV.

This problem only affects the non-default gw_sel_class 1.

Signed-off-by: Ruben Wisniewsi <ruben@vfn-nrw.de>
[sven@narfation.org: rewritten commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-11 18:10:04 +02:00
Sven Eckelmann e071d93eb4 batman-adv: Replace gw_reselect divisor with simple shift
The gw_factor is divided by BATADV_TQ_LOCAL_WINDOW_SIZE ** 2 * 64. But the
rest of the calculation has nothing to do with the tq window size and
therefore the calculation is just (tmp_gw_factor / (64 ** 3)).

Replace it with a simple shift to avoid a costly 64-bit divide when the
max_gw_factor is changed from u32 to u64. This type change is necessary
to avoid an overflow bug.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-11 18:10:04 +02:00
David Ahern 42a7b32b73 xfrm: Add oif to dst lookups
Rules can be installed that direct route lookups to specific tables based
on oif. Plumb the oif through the xfrm lookups so it gets set in the flow
struct and passed to the resolver routines.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-08-11 12:41:35 +02:00
Andrzej Hajda df367561ff net/xfrm: use kmemdup rather than duplicating its implementation
The patch was generated using fixed coccinelle semantic patch
scripts/coccinelle/api/memdup.cocci [1].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2014320

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-08-11 12:41:35 +02:00
Thomas Egerer eae8dee992 xfrm6: Fix IPv6 ECN decapsulation
Using ipv6_get_dsfield on the outer IP header implies that inner and
outer header are of the the same address family. For interfamily
tunnels, particularly 646, the code reading the DSCP field obtains the
wrong values (IHL and the upper four bits of the DSCP field).
This can cause the code to detect a congestion encoutered state in the
outer header and enable the corresponding bits in the inner header, too.

Since the DSCP field is stored in the xfrm mode common buffer
independently from the IP version of the outer header, it's safe (and
correct) to take this value from there.

Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2015-08-11 12:41:34 +02:00
Daniel Borkmann 308ac9143e netfilter: nf_conntrack: push zone object into functions
This patch replaces the zone id which is pushed down into functions
with the actual zone object. It's a bigger one-time change, but
needed for later on extending zones with a direction parameter, and
thus decoupling this additional information from all call-sites.

No functional changes in this patch.

The default zone becomes a global const object, namely nf_ct_zone_dflt
and will be returned directly in various cases, one being, when there's
f.e. no zoning support.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-11 12:29:01 +02:00
Eric Dumazet 3257d8b12f inet: fix possible request socket leak
In commit b357a364c5 ("inet: fix possible panic in
reqsk_queue_unlink()"), I missed fact that tcp_check_req()
can return the listener socket in one case, and that we must
release the request socket refcount or we leak it.

Tested:

 Following packetdrill test template shows the issue

0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0    setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0    bind(3, ..., ...) = 0
+0    listen(3, 1) = 0

+0    < S 0:0(0) win 2920 <mss 1460,sackOK,nop,nop>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.002 < . 1:1(0) ack 21 win 2920
+0    > R 21:21(0)

Fixes: b357a364c5 ("inet: fix possible panic in reqsk_queue_unlink()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 21:17:45 -07:00
Eric Dumazet 2235f2ac75 inet: fix races with reqsk timers
reqsk_queue_destroy() and reqsk_queue_unlink() should use
del_timer_sync() instead of del_timer() before calling reqsk_put(),
otherwise we could free a req still used by another cpu.

But before doing so, reqsk_queue_destroy() must release syn_wait_lock
spinlock or risk a dead lock, as reqsk_timer_handler() might
need to take this same spinlock from reqsk_queue_unlink() (called from
inet_csk_reqsk_queue_drop())

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 21:17:29 -07:00
David S. Miller 182554570a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains five Netfilter fixes for your net tree,
they are:

1) Silence a warning on falling back to vmalloc(). Since 88eab472ec, we can
   easily hit this warning message, that gets users confused. So let's get rid
   of it.

2) Recently when porting the template object allocation on top of kmalloc to
   fix the netns dependencies between x_tables and conntrack, the error
   checks where left unchanged. Remove IS_ERR() and check for NULL instead.
   Patch from Dan Carpenter.

3) Don't ignore gfp_flags in the new nf_ct_tmpl_alloc() function, from
   Joe Stringer.

4) Fix a crash due to NULL pointer dereference in ip6t_SYNPROXY, patch from
   Phil Sutter.

5) The sequence number of the Syn+ack that is sent from SYNPROXY to clients is
   not adjusted through our NAT infrastructure, as a result the client may
   ignore this TCP packet and TCP flow hangs until the client probes us.  Also
   from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 21:03:25 -07:00
Andrew Lunn 6bc6d0a881 dsa: Support multiple MDIO busses
When using a cluster of switches, some topologies will have an MDIO
bus per switch, not one for the whole cluster. Allow this to be
represented in the device tree, by adding an optional mii-bus property
at the switch level. The old platform_device method of instantiation
supports this already, so only the device tree binding needs extending
with an additional optional phandle.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 14:25:02 -07:00
Pravin B Shelar 9f57c67c37 gre: Remove support for sharing GRE protocol hook.
Support for sharing GREPROTO_CISCO port was added so that
OVS gre port and kernel GRE devices can co-exist. After
flow-based tunneling patches OVS GRE protocol processing
is completely moved to ip_gre module. so there is no need
for GRE protocol hook. Following patch consolidates
GRE protocol related functions into ip_gre module.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 14:03:54 -07:00
Pravin B Shelar b2acd1dc39 openvswitch: Use regular GRE net_device instead of vport
Using GRE tunnel meta data collection feature, we can implement
OVS GRE vport. This patch removes all of the OVS
specific GRE code and make OVS use a ip_gre net_device.
Minimal GRE vport is kept to handle compatibility with
current userspace application.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 14:03:54 -07:00
Pravin B Shelar 2e15ea390e ip_gre: Add support to collect tunnel metadata.
Following patch create new tunnel flag which enable
tunnel metadata collection on given device.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 14:03:54 -07:00
Pravin B Shelar a9020fde67 openvswitch: Move tunnel destroy function to oppenvswitch module.
This function will be used in gre and geneve vport implementations.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 14:03:54 -07:00
Rick Jones fb811395cd net: add explicit logging and stat for neighbour table overflow
Add an explicit neighbour table overflow message (ratelimited) and
statistic to make diagnosing neighbour table overflows tractable in
the wild.

Diagnosing a neighbour table overflow can be quite difficult in the wild
because there is no explicit dmesg logged.  Callers to neighbour code
seem to use net_dbg_ratelimit when the neighbour call fails which means
the "base message" is not emitted and the callback suppressed messages
from the ratelimiting can end-up juxtaposed with unrelated messages.
Further, a forced garbage collection will increment a stat on each call
whether it was successful in freeing-up a table entry or not, so that
statistic is only a hint.  So, add a net_info_ratelimited message and
explicit statistic to the neighbour code.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 13:46:21 -07:00
Nikolay Aleksandrov a7854037da bridge: netlink: add support for vlan_filtering attribute
This patch adds the ability to toggle the vlan filtering support via
netlink. Since we're already running with rtnl in .changelink() we don't
need to take any additional locks.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 13:36:43 -07:00
Florian Westphal 330567b71d ipv6: don't reject link-local nexthop on other interface
48ed7b26fa ("ipv6: reject locally assigned nexthop addresses") is too
strict; it rejects following corner-case:

ip -6 route add default via fe80::1:2:3 dev eth1

[ where fe80::1:2:3 is assigned to a local interface, but not eth1 ]

Fix this by restricting search to given device if nh is linklocal.

Joint work with Hannes Frederic Sowa.

Fixes: 48ed7b26fa ("ipv6: reject locally assigned nexthop addresses")
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 13:29:22 -07:00
Jeff Layton 1b6dc1dffb nfsd/sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads refcounting
In later patches, we'll want to be able to allocate and free svc_rqst
structures without monkeying with the serv->sv_nrthreads refcount.

Factor those pieces out of their respective functions.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:44 -04:00
Jeff Layton d70bc0c67c nfsd/sunrpc: move pool_mode definitions into svc.h
In later patches, we're going to need to allow code external to svc.c
to figure out what pool_mode is in use. Move these definitions into
svc.h to prepare for that.

Also, make the svc_pool_map object available and exported so that other
modules can peek in there to get insight into what pool mode is in use.
Likewise, export svc_pool_map_get/put function to make it safe to do so.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:43 -04:00
Jeff Layton b9e13cdfac nfsd/sunrpc: turn enqueueing a svc_xprt into a svc_serv operation
For now, all services use svc_xprt_do_enqueue, but once we add
workqueue-based service support, we'll need to do something different.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:42 -04:00
Jeff Layton 758f62fff9 nfsd/sunrpc: move sv_module parm into sv_ops
...not technically an operation, but it's more convenient and cleaner
to pass the module pointer in this struct.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:41 -04:00
Jeff Layton c369014f17 nfsd/sunrpc: move sv_function into sv_ops
Since we now have a container for holding svc_serv operations, move the
sv_function into it as well.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:41 -04:00
Jeff Layton ea126e7435 nfsd/sunrpc: add a new svc_serv_ops struct and move sv_shutdown into it
In later patches we'll need to abstract out more operations on a
per-service level, besides sv_shutdown and sv_function.

Declare a new svc_serv_ops struct to hold these operations, and move
sv_shutdown into this struct.

Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:05:40 -04:00
Chuck Lever cc9a903d91 svcrdma: Change maximum server payload back to RPCSVC_MAXPAYLOAD
Both commit 0380a3f375 ("svcrdma: Add a separate "max data segs"
macro for svcrdma") and commit 7e5be28827 ("svcrdma: advertise
the correct max payload") are incorrect. This commit reverts both
changes, restoring the server's maximum payload size to 1MB.

Commit 7e5be28827 based the server's maximum payload on the
_client's_ RPCRDMA_MAX_DATA_SEGS value. That was wrong.

Commit 0380a3f375 tried to fix this so that the client maximum
payload size could be raised without affecting the server, but
managed to confuse matters more on the server side.

More importantly, limiting the advertised maximum payload size was
meant to be a workaround, not the actual fix. We need to revisit

  https://bugzilla.linux-nfs.org/show_bug.cgi?id=270

A Linux client on a platform with 64KB pages can overrun and crash
an x86_64 NFS/RDMA server when the r/wsize is 1MB. An x86/64 Linux
client seems to work fine using 1MB reads and writes when the Linux
server's maximum payload size is restored to 1MB.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Fixes: 0380a3f375 ("svcrdma: Add a separate "max data segs" macro")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-08-10 16:04:57 -04:00
Jakub Pawlowski fa14222077 Bluetooth: Enable new connection establishment procedure.
Currently, when trying to connect to already paired device that just
rotated its RPA MAC address, old address would be used and connection
would fail. In order to fix that, kernel must scan and receive
advertisement with fresh RPA before connecting.

This patch enables new connection establishment procedure. Instead of just
sending HCI_OP_LE_CREATE_CONN to controller, "connect" will add device to
kernel whitelist and start scan. If advertisement is received, it'll be
compared against whitelist and then trigger connection if it matches.
That fixes mentioned reconnect issue for  already paired devices. It also
make whole connection procedure more robust. We can try to connect to
multiple devices at same time now, even though controller allow only one.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 21:36:13 +02:00
Jakub Pawlowski cc2b6911a2 Bluetooth: timeout handling in new connect procedure
Currently, when trying to connect to already paired device that just
rotated its RPA MAC address, old address would be used and connection
would fail. In order to fix that, kernel must scan and receive
advertisement with fresh RPA before connecting.

This patch makes sure that when new procedure is in use, and we're stuck
in scan phase because no advertisement was received and timeout happened,
or app decided to close socket, scan whitelist gets properly cleaned up.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 21:36:13 +02:00
Jakub Pawlowski 28a667c9c2 Bluetooth: advertisement handling in new connect procedure
Currently, when trying to connect to already paired device that just
rotated its RPA MAC address, old address would be used and connection
would fail. In order to fix that, kernel must scan and receive
advertisement with fresh RPA before connecting.

This path makes sure that after advertisement is received from device that
we try to connect to, it is properly handled in check_pending_le_conn and
trigger connect attempt.

It also modifies hci_le_connect to make sure that connect attempt will be
properly continued.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 21:36:13 +02:00
Jakub Pawlowski f75113a260 Bluetooth: add hci_connect_le_scan
Currently, when trying to connect to already paired device that just
rotated its RPA MAC address, old address would be used and connection
would fail. In order to fix that, kernel must scan and receive
advertisement with fresh RPA before connecting.

This patch adds hci_connect_le_scan with dependencies, new method that
will be used to connect to remote LE devices. Instead of just sending
connect request, it adds a device to whitelist. Later patches will make
use of this whitelist to send conenct request when advertisement is
received, and properly handle timeouts.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 21:36:13 +02:00
Jakub Pawlowski e7d9ab731a Bluetooth: add hci_lookup_le_connect
This patch adds hci_lookup_le_connect method, that will be used to check
wether outgoing le connection attempt is in progress.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 21:36:13 +02:00
Alexander Aring 8f8db91840 ieee802154: 6lowpan: fix error frag handling
This patch fixes the error handling for lowpan_xmit_fragment by replace
"-PTR_ERR" to "PTR_ERR". PTR_ERR returns already a negative errno code.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:07 +02:00
Alexander Aring c91208d819 ieee802154: add ack request default handling
This patch introduce a new mib entry which isn't part of 802.15.4 but
useful as default behaviour to set the ack request bit or not if we
don't know if the ack request bit should set. This is currently used for
stacks like IEEE 802.15.4 6LoWPAN.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:06 +02:00
Alexander Aring 89c7d788f8 mac802154: change frame_retries behaviour
This patch changes the default minimum value of frame_retries to 0 and
changes the frame_retries default value to 3 which is also 802.15.4
default.

We don't use the frame_retries "-1" value as indicator for no-aret mode
anymore, instead we checking on the ack request bit inside the 802.15.4
frame control field. This allows a acknowledge handling per frame. This
checking is done by transceiver or inside xmit callback of driver layer.

If a transceiver doesn't support ARET handling the transmit
functionality ignores ack frames then, which isn't well but should not
effect anything of current functionality.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:06 +02:00
Alexander Aring 91f02b3dd8 mac802154: cfg: remove test and set checks
This patch removes several checks if a value is really changed. This
makes only sense if we have another layer call e.g. calling the
driver_ops which is done by callbacks like "set_channel".

For MAC settings which need to be set by phy registers (if the phy
supports that handling) this is set by doing an interface up currently
and are not direct driver_ops calls, so we remove the checks from these
configuration callbacks.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:06 +02:00
Alexander Aring 09095fdc9e mac802154: fix wpan mac setting while lowpan is there
If we currently change the mac address inside the wpan interface while
we have a lowpan interface on top of the wpan interface, the mac address
setting doesn't reach the lowpan interface. The effect would be that the
IPv6 lowpan interface has the old SLAAC address and isn't working
anymore because the lowpan interface use in internal mechanism sometimes
dev->addr which is the old mac address of the wpan interface.

This patch checks if a wpan interface belongs to lowpan interface, if
yes then we need to check if the lowpan interface is down and change the
mac address also at the lowpan interface. When the lowpan interface will
be set up afterwards, it will use the correct SLAAC address which based
on the updated mac address setting.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Tested-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:06 +02:00
Alexander Aring 51e0e5d812 ieee802154: 6lowpan: remove multiple lowpan per wpan support
We currently supports multiple lowpan interfaces per wpan interface. I
never saw any use case into such functionality. We drop this feature now
because it's much easier do deal with address changes inside the under
laying wpan interface.

This patch removes the multiple lowpan interface and adds a lowpan_dev
netdev pointer into the wpan_dev, if this pointer isn't null the wpan
interface belongs to the assigned lowpan interface.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Tested-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:06 +02:00
Lukasz Duda 77e867b5f1 6lowpan: Fix extraction of flow label field
The lowpan_fetch_skb function is used to fetch the first byte,
which also increments the data pointer in skb structure,
making subsequent array lookup of byte 0 actually being byte 1.

To decompress the first byte of the Flow Label when the TF flag is
set to 0x01, the second half of the first byte is needed.

The patch fixes the extraction of the Flow Label field.

Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Lukasz Duda <lukasz.duda@nordicsemi.no>
Signed-off-by: Glenn Ruben Bakke <glenn.ruben.bakke@nordicsemi.no>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:43:05 +02:00
Dan Carpenter 0208bc8803 Bluetooth: Fix breakage in amp_write_rem_assoc_frag()
We should be passing the pointer itself instead of the address of the
pointer.

This was a copy and paste bug when we replaced the calls to
hci_send_cmd().  Originally, the arguments were "len, cp" but we
overwrote them with "sizeof(cp), &cp" by mistake.

Fixes: b3d3914006 ('Bluetooth: Move amp assoc read/write completed callback to amp.c')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-10 20:41:34 +02:00
Daniel Borkmann 4e7c133068 netlink: make sure -EBUSY won't escape from netlink_insert
Linus reports the following deadlock on rtnl_mutex; triggered only
once so far (extract):

[12236.694209] NetworkManager  D 0000000000013b80     0  1047      1 0x00000000
[12236.694218]  ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
[12236.694224]  ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
[12236.694230]  ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
[12236.694235] Call Trace:
[12236.694250]  [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10
[12236.694257]  [<ffffffff815d133a>] ? schedule+0x2a/0x70
[12236.694263]  [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10
[12236.694271]  [<ffffffff815d2c3f>] ? __mutex_lock_slowpath+0x7f/0xf0
[12236.694280]  [<ffffffff815d2cc6>] ? mutex_lock+0x16/0x30
[12236.694291]  [<ffffffff814f1f90>] ? rtnetlink_rcv+0x10/0x30
[12236.694299]  [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180
[12236.694309]  [<ffffffff814f5ad3>] ? rtnl_getlink+0x113/0x190
[12236.694319]  [<ffffffff814f202a>] ? rtnetlink_rcv_msg+0x7a/0x210
[12236.694331]  [<ffffffff8124565c>] ? sock_has_perm+0x5c/0x70
[12236.694339]  [<ffffffff814f1fb0>] ? rtnetlink_rcv+0x30/0x30
[12236.694346]  [<ffffffff8150d62c>] ? netlink_rcv_skb+0x9c/0xc0
[12236.694354]  [<ffffffff814f1f9f>] ? rtnetlink_rcv+0x1f/0x30
[12236.694360]  [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180
[12236.694367]  [<ffffffff8150d344>] ? netlink_sendmsg+0x484/0x5d0
[12236.694376]  [<ffffffff810a236f>] ? __wake_up+0x2f/0x50
[12236.694387]  [<ffffffff814cad23>] ? sock_sendmsg+0x33/0x40
[12236.694396]  [<ffffffff814cb05e>] ? ___sys_sendmsg+0x22e/0x240
[12236.694405]  [<ffffffff814cab75>] ? ___sys_recvmsg+0x135/0x1a0
[12236.694415]  [<ffffffff811a9d12>] ? eventfd_write+0x82/0x210
[12236.694423]  [<ffffffff811a0f9e>] ? fsnotify+0x32e/0x4c0
[12236.694429]  [<ffffffff8108cb70>] ? wake_up_q+0x60/0x60
[12236.694434]  [<ffffffff814cba09>] ? __sys_sendmsg+0x39/0x70
[12236.694440]  [<ffffffff815d4797>] ? entry_SYSCALL_64_fastpath+0x12/0x6a

It seems so far plausible that the recursive call into rtnetlink_rcv()
looks suspicious. One way, where this could trigger is that the senders
NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
the rtnl_getlink() request's answer would be sent to the kernel instead
to the actual user process, thus grabbing rtnl_mutex() twice.

One theory would be that netlink_autobind() triggered via netlink_sendmsg()
internally overwrites the -EBUSY error to 0, but where it is wrongly
originating from __netlink_insert() instead. That would reset the
socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
later on. As commit d470e3b483 ("[NETLINK]: Fix two socket hashing bugs.")
also puts it, -EBUSY should not be propagated from netlink_insert().

It looks like it's very unlikely to reproduce. We need to trigger the
rhashtable_insert_rehash() handler under a situation where rehashing
currently occurs (one /rare/ way would be to hit ht->elasticity limits
while not filled enough to expand the hashtable, but that would rather
require a specifically crafted bind() sequence with knowledge about
destination slots, seems unlikely). It probably makes sense to guard
__netlink_insert() in any case and remap that error. It was suggested
that EOVERFLOW might be better than an already overloaded ENOMEM.

Reference: http://thread.gmane.org/gmane.linux.network/372676
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-10 10:59:10 -07:00
Phil Sutter 3c16241c44 netfilter: SYNPROXY: fix sending window update to client
Upon receipt of SYNACK from the server, ipt_SYNPROXY first sends back an ACK to
finish the server handshake, then calls nf_ct_seqadj_init() to initiate
sequence number adjustment of forwarded packets to the client and finally sends
a window update to the client to unblock it's TX queue.

Since synproxy_send_client_ack() does not set synproxy_send_tcp()'s nfct
parameter, no sequence number adjustment happens and the client receives the
window update with incorrect sequence number. Depending on client TCP
implementation, this leads to a significant delay (until a window probe is
being sent).

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-10 13:55:07 +02:00
Phil Sutter 96fffb4f23 netfilter: ip6t_SYNPROXY: fix NULL pointer dereference
This happens when networking namespaces are enabled.

Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-10 13:54:44 +02:00
Masanari Iida ecea49914b net: ethernet: Fix double word "the the" in eth.c
This patch fix double word "the the" in
Documentation/DocBook/networking/API-eth-get-headlen.html
Documentation/DocBook/networking/netdev.html
Documentation/DocBook/networking.xml

These files are generated from comment in source,
so I have to fix comment in net/ethernet/eth.c.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:53:00 -07:00
Robert Shearman 118d523463 mpls: Enforce payload type of traffic sent using explicit NULL
RFC 4182 s2 states that if an IPv4 Explicit NULL label is the only
label on the stack, then after popping the resulting packet must be
treated as a IPv4 packet and forwarded based on the IPv4 header. The
same is true for IPv6 Explicit NULL with an IPv6 packet following.

Therefore, when installing the IPv4/IPv6 Explicit NULL label routes,
add an attribute that specifies the expected payload type for use at
forwarding time for determining the type of the encapsulated packet
instead of inspecting the first nibble of the packet.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:51:42 -07:00
Vivien Didelot 55045ddded net: dsa: add support for switchdev FDB objects
Remove the fdb_{add,del,getnext} function pointer in favor of new
port_fdb_{add,del,getnext}.

Implement the switchdev_port_obj_{add,del,dump} functions in DSA to
support the SWITCHDEV_OBJ_PORT_FDB objects.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:48:09 -07:00
Vivien Didelot 890248261a net: switchdev: support static FDB addresses
This patch adds a is_static boolean to the switchdev_obj_fdb structure,
in order to set the ndm_state to either NUD_NOARP or NUD_REACHABLE.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:48:09 -07:00
Vivien Didelot 1525c386a1 net: switchdev: change fdb addr for a byte array
The address in the switchdev_obj_fdb structure is currently represented
as a pointer. Replacing it for a 6-byte array allows switchdev to carry
addresses directly read from hardware registers, not stored by the
switch chip driver (as in Rocker).

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:48:08 -07:00
Masanari Iida 4933d85c51 net:wimax: Fix doucble word "the the" in networking.xml
This patch fix a double word "the the"
in Documentation/DocBook/networking.xml and
Documentation/DocBook/networking/API-Wimax-report-rfkill-sw.html.

These files are generated from comment in source, so I had to
fix the typo in net/wimax/io-rfkill.c

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-09 22:43:52 -07:00
Tom Herbert 10e4ea7514 net: Fix race condition in store_rps_map
There is a race condition in store_rps_map that allows jump label
count in rps_needed to go below zero. This can happen when
concurrently attempting to set and a clear map.

Scenario:

1. rps_needed count is zero
2. New map is assigned by setting thread, but rps_needed count _not_ yet
   incremented (rps_needed count still zero)
2. Map is cleared by second thread, old_map set to that just assigned
3. Second thread performs static_key_slow_dec, rps_needed count now goes
   negative

Fix is to increment or decrement rps_needed under the spinlock.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 15:56:56 -07:00
David S. Miller 95a428f3be Included changes:
- prevent DAT from replying on behalf of local clients and confuse L2
   bridges
 - fix crash on double list removal of TT objects (tt_local_entry)
 - fix crash due to missing NULL checks
 - initialize bw values for new GWs objects to prevent memory leak
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVwdK+AAoJEOb/4TMchkvfO9UP/i8vIBZ0XabuYlG9r5GnYh8s
 1WvO7jbe4zYhe3WrfAibcfcxMT6v0aI6hc/VRCb36zTNT3VJeh76+fKGwLC4ZU08
 UdefhzCavV/gg2fX0b67aA/SsC6I9FrZAG8VOt5GXhMWqjZYSwbxMKLrNe/QRJks
 hUzfcQbmFjIMSTK2Zf4/MV0AHzzPKbHixE8t3to2DmlHhZ6lCWV5l42nv/H/s8js
 uxNI06rnQBY7nY06rZEualznEZpN7AFtFeIw9zXDZ0PgkIPrsjShwu/ZyzVK9wg1
 4ld1JQjf1uKbrOqvdY4bAx8EUUQQ3n68513IXuKLgAdPcBGYVFo8atbk9K24QPue
 KzO5tqg7ELf6EO1/haH/VLOTCrMxiC1C62CuNxQn1rDIUAIUrtX9Rqt2f2bS75KG
 tGNJTUfQfpNCY4IlqaUE8cdzABiY6cLIf68gSmXl/FfBzVNWDc1Totsi/B6C2WEM
 PYjXnmPc97tz0/iWRpvOMd1DxoMXnM0odH4gifUt3Osl0hKMbelhnGKACCvHM8jK
 3EOOz1yZIEoX9OkwtFnCji+web884ASUJjnyrNJgbyOY3juj2B0hSrWgK++trUMS
 +5oZMkFtHaar0l+JRc2Xhf/GqtkY50lGlwcPlOukDyKJ+RvPTRsShIYhns75Z2us
 dWqAlJEdf3kRZuDri8O3
 =JfaB
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
Included changes:
- prevent DAT from replying on behalf of local clients and confuse L2
  bridges
- fix crash on double list removal of TT objects (tt_local_entry)
- fix crash due to missing NULL checks
- initialize bw values for new GWs objects to prevent memory leak
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 15:51:10 -07:00
Wenyu Zhang e05176a328 openvswitch: Make 100 percents packets sampled when sampling rate is 1.
When sampling rate is 1, the sampling probability is UINT32_MAX. The packet
should be sampled even the prandom32() generate the number of UINT32_MAX.
And none packet need be sampled when the probability is 0.

Signed-off-by: Wenyu Zhang <wenyuz@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 12:00:38 -07:00
Alexei Starovoitov da8b43c0e1 vxlan: combine VXLAN_FLOWBASED into VXLAN_COLLECT_METADATA
IFLA_VXLAN_FLOWBASED is useless without IFLA_VXLAN_COLLECT_METADATA,
so combine them into single IFLA_VXLAN_COLLECT_METADATA flag.
'flowbased' doesn't convey real meaning of the vxlan tunnel mode.
This mode can be used by routing, tc+bpf and ovs.
Only ovs is strictly flow based, so 'collect metadata' is a better
name for this tunnel mode.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:46:34 -07:00
Sowmini Varadhan 467fa15356 RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
Register pernet subsys init/stop functions that will set up
and tear down per-net RDS-TCP listen endpoints. Unregister
pernet subusys functions on 'modprobe -r' to clean up these
end points.

Enable keepalive on both accept and connect socket endpoints.
The keepalive timer expiration will ensure that client socket
endpoints will be removed as appropriate from the netns when
an interface is removed from a namespace.

Register a device notifier callback that will clean up all
sockets (and thus avoid the need to wait for keepalive timeout)
when the loopback device is unregistered from the netns indicating
that the netns is getting deleted.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:29:58 -07:00
Sowmini Varadhan d5a8ac28a7 RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
Open the sockets calling sock_create_kern() with the correct struct net
pointer, and use that struct net pointer when verifying the
address passed to rds_bind().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-07 11:29:57 -07:00
Andreas Schultz 3499abb249 netfilter: nfacct: per network namespace support
- Move the nfnl_acct_list into the network namespace, initialize
  and destroy it per namespace
- Keep track of refcnt on nfacct objects, the old logic does not
  longer work with a per namespace list
- Adjust xt_nfacct to pass the namespace when registring objects

Signed-off-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:50:56 +02:00
Pablo Neira Ayuso d2168e849e netfilter: nft_limit: add per-byte limiting
This patch adds a new NFTA_LIMIT_TYPE netlink attribute to indicate the type of
limiting.

Contrary to per-packet limiting, the cost is calculated from the packet path
since this depends on the packet length.

The burst attribute indicates the number of bytes in which the rate can be
exceeded.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:50:50 +02:00
Pablo Neira Ayuso 8bdf362642 netfilter: nft_limit: constant token cost per packet
The cost per packet can be calculated from the control plane path since this
doesn't ever change.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:50 +02:00
Pablo Neira Ayuso 3e87baafa4 netfilter: nft_limit: add burst parameter
This patch adds the burst parameter. This burst indicates the number of packets
that can exceed the limit.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:50 +02:00
Pablo Neira Ayuso f8d3a6bc76 netfilter: nft_limit: factor out shared code with per-byte limiting
This patch prepares the introduction of per-byte limiting.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:49 +02:00
Pablo Neira Ayuso dba27ec1bc netfilter: nft_limit: convert to token-based limiting at nanosecond granularity
Rework the limit expression to use a token-based limiting approach that refills
the bucket gradually. The tokens are calculated at nanosecond granularity
instead jiffies to improve precision.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:49 +02:00
Pablo Neira Ayuso 09e4e42a00 netfilter: nft_limit: rename to nft_limit_pkts
To prepare introduction of bytes ratelimit support.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:49 +02:00
Pablo Neira Ayuso d877f07112 netfilter: nf_tables: add nft_dup expression
This new expression uses the nf_dup engine to clone packets to a given gateway.
Unlike xt_TEE, we use an index to indicate output interface which should be
fine at this stage.

Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
nf_dup_ipv{4,6} to silence a lockdep splat.

Based on the original tee expression from Arturo Borrero Gonzalez, although
this patch has diverted quite a bit from this initial effort due to the
change to support maps.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:49 +02:00
Pablo Neira Ayuso bbde9fc182 netfilter: factor out packet duplication for IPv4/IPv6
Extracted from the xtables TEE target. This creates two new modules for IPv4
and IPv6 that are shared between the TEE target and the new nf_tables dup
expressions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:49 +02:00
Pablo Neira Ayuso 24b7811fa5 netfilter: xt_TEE: get rid of WITH_CONNTRACK definition
Use IS_ENABLED(CONFIG_NF_CONNTRACK) instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:48 +02:00
Pablo Neira Ayuso 0c45e76960 netfilter: nft_counter: convert it to use per-cpu counters
This patch converts the existing seqlock to per-cpu counters.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-07 11:49:48 +02:00
Nikolay Aleksandrov 786c2077ec bridge: netlink: account for the IFLA_BRPORT_PROXYARP_WIFI attribute size and policy
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 842a9ae08a ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 23:54:42 -07:00
Nikolay Aleksandrov 355b9f9df1 bridge: netlink: account for the IFLA_BRPORT_PROXYARP attribute size and policy
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 958501163d ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 23:54:42 -07:00
Oleg Nesterov 7ba8bd75dd net: pktgen: don't abuse current->state in pktgen_thread_worker()
Commit 1fbe4b46ca "net: pktgen: kill the Wait for kthread_stop
code in pktgen_thread_worker()" removed (in particular) the final
__set_current_state(TASK_RUNNING) and I didn't notice the previous
set_current_state(TASK_INTERRUPTIBLE). This triggers the warning
in __might_sleep() after return.

Afaics, we can simply remove both set_current_state()'s, and we
could do this a long ago right after ef87979c27 "pktgen: better
scheduler friendliness" which changed pktgen_thread_worker() to
use wait_event_interruptible_timeout().

Reported-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 23:52:44 -07:00
Roopa Prabhu 3dcb615e68 af_mpls: add null dev check in find_outdev
This patch adds null dev check for the 'cfg->rc_via_table ==
NEIGH_LINK_TABLE or dev_get_by_index() failed' case

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 22:03:58 -07:00
Dan Carpenter 5a9348b54d mpls: small cleanup in inet/inet6_fib_lookup_dev()
We recently changed this code from returning NULL to returning ERR_PTR.
There are some left over NULL assignments which we can remove.  We can
preserve the error code from ip_route_output() instead of always
returning -ENODEV.  Also these functions use a mix of gotos and direct
returns.  There is no cleanup necessary so I changed the gotos to
direct returns.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 21:56:55 -07:00
Herbert Xu a0a2a66024 net: Fix skb_set_peeked use-after-free bug
The commit 738ac1ebb9 ("net: Clone
skb before setting peeked flag") introduced a use-after-free bug
in skb_recv_datagram.  This is because skb_set_peeked may create
a new skb and free the existing one.  As it stands the caller will
continue to use the old freed skb.

This patch fixes it by making skb_set_peeked return the new skb
(or the old one if unchanged).

Fixes: 738ac1ebb9 ("net: Clone skb before setting peeked flag")
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Brenden Blanco <bblanco@plumgrid.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-06 21:55:47 -07:00
Jakub Pawlowski cb92205bad Bluetooth: fix MGMT_EV_NEW_LONG_TERM_KEY event
This patch fixes how MGMT_EV_NEW_LONG_TERM_KEY event is build. Right now
val vield is filled with only 1 byte, instead of whole value. This bug
was introduced in
commit 1fc62c526a ("Bluetooth: Fix exposing full value of shortened LTKs")

Before that patch, if you paired with device using bluetoothd using simple
pairing, and then restarted bluetoothd, you would be able to re-connect,
but device would fail to establish encryption and would terminate
connection. After this patch connecting after bluetoothd restart works
fine.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-08-06 16:36:03 +02:00
Devesh Sharma d0f36c46de xprtrdma: take HCA driver refcount at client
This is a rework of the following patch sent almost a year back:
http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html

In presence of active mount if someone tries to rmmod vendor-driver, the
command remains stuck forever waiting for destruction of all rdma-cm-id.
in worst case client can crash during shutdown with active mounts.

The existing code assumes that ia->ri_id->device cannot change during
the lifetime of a transport. xprtrdma do not have support for
DEVICE_REMOVAL event either. Lifting that assumption and adding support
for DEVICE_REMOVAL event is a long chain of work, and is in plan.

The community decided that preventing the hang right now is more
important than waiting for architectural changes.

Thus, this patch introduces a temporary workaround to acquire HCA driver
module reference count during the mount of a nfs-rdma mount point.

Signed-off-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:22:56 -04:00
Chuck Lever 860477d1ff xprtrdma: Count RDMA_NOMSG type calls
RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
calls so administrators can tell if they happen to be used more than
expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 763f7e4e4b xprtrdma: Clean up xprt_rdma_print_stats()
checkpatch.pl complained about the seq_printf() format string split
across lines and the use of %Lu.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 2fcc213a18 xprtrdma: Fix large NFS SYMLINK calls
Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.

Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.

Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.

Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 677eb17e94 xprtrdma: Fix XDR tail buffer marshalling
Currently xprtrdma appends an extra chunk element to the RPC/RDMA
read chunk list of each NFSv4 WRITE compound. The extra element
contains the final GETATTR operation in the compound.

The result is an extra RDMA READ operation to transfer a very short
piece of each NFS WRITE compound (typically 16 bytes). This is
inefficient.

It is also incorrect.

The client is sending the trailing GETATTR at the same Position as
the preceding WRITE data payload. Whether or not RFC 5667 allows
the GETATTR to appear in a read chunk, RFC 5666 requires that these
two separate RPC arguments appear at two distinct Positions.

It can also be argued that the GETATTR operation is not bulk data,
and therefore RFC 5667 forbids its appearance in a read chunk at
all.

Although RFC 5667 is not precise about when using a read list with
NFSv4 COMPOUND is allowed, the intent is that only data arguments
not touched by NFS (ie, read and write payloads) are to be sent
using RDMA READ or WRITE.

The NFS client constructs GETATTR arguments itself, and therefore is
required to send the trailing GETATTR operation as additional inline
content, not as a data payload.

NB: This change is not backwards compatible. Some older servers do
not accept inline content following the read list. The Linux NFS
server should handle this content correctly as of commit
a97c331f9a ("svcrdma: Handle additional inline content").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:28 -04:00
Chuck Lever 33943b2974 xprtrdma: Don't provide a reply chunk when expecting a short reply
Currently Linux always offers a reply chunk, even when the reply
can be sent inline (ie. is smaller than 1KB).

On the client, registering a memory region can be expensive. A
server may choose not to use the reply chunk, wasting the cost of
the registration.

This is a change only for RPC replies smaller than 1KB which the
server constructs in the RPC reply send buffer. Because the elements
of the reply must be XDR encoded, a copy-free data transfer has no
benefit in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever 02eb57d8f4 xprtrdma: Always provide a write list when sending NFS READ
The client has been setting up a reply chunk for NFS READs that are
smaller than the inline threshold. This is not efficient: both the
server and client CPUs have to copy the reply's data payload into
and out of the memory region that is then transferred via RDMA.

Using the write list, the data payload is moved by the device and no
extra data copying is necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever 5457ced0b5 xprtrdma: Account for RPC/RDMA header size when deciding to inline
When the size of the RPC message is near the inline threshold (1KB),
the client would allow messages to be sent that were a few bytes too
large.

When marshaling RPC/RDMA requests, ensure the combined size of
RPC/RDMA header and RPC header do not exceed the inline threshold.
Endpoints typically reject RPC/RDMA messages that exceed the size
of their receive buffers.

The two server implementations I test with (Linux and Solaris) use
receive buffers that are larger than the client’s inline threshold.
Thus so far this has been benign, observed only by code inspection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever b3221d6a53 xprtrdma: Remove logic that constructs RDMA_MSGP type calls
RDMA_MSGP type calls insert a zero pad in the middle of the RPC
message to align the RPC request's data payload to the server's
alignment preferences. A server can then "page flip" the payload
into place to avoid a data copy in certain circumstances. However:

1. The client has to have a priori knowledge of the server's
   preferred alignment

2. Requests eligible for RDMA_MSGP are requests that are small
   enough to have been sent inline, and convey a data payload
   at the _end_ of the RPC message

Today 1. is done with a sysctl, and is a global setting that is
copied during mount. Linux does not support CCP to query the
server's preferences (RFC 5666, Section 6).

A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
compound fits bullet 2.

Thus the Linux client currently leaves RDMA_MSGP disabled. The
Linux server handles RDMA_MSGP, but does not use any special
page flipping, so it confers no benefit.

Clean up the marshaling code by removing the logic that constructs
RDMA_MSGP type calls. This also reduces the maximum send iovec size
from four to just two elements.

/proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
thus is left in place.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever d1ed857e57 xprtrdma: Clean up rpcrdma_ia_open()
Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which
is different for each registration method, to the .ro_open functions.

This is refactoring only. No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:27 -04:00
Chuck Lever e531dcabec xprtrdma: Remove last ib_reg_phys_mr() call site
All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr()
fails, rpcrdma_ia_open() fails and no transport is created.

Therefore execution never reaches the ib_reg_phys_mr() call site in
rpcrdma_register_internal(), so it can be removed.

The remaining logic in rpcrdma_{de}register_internal() is folded
into rpcrdma_{alloc,free}_regbuf().

This is clean up only. No behavior change is expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever d231093023 xprtrdma: Don't fall back to PHYSICAL memory registration
PHYSICAL memory registration uses a single rkey for all of the
client's memory, thus is insecure. It is still useful in some cases
for testing.

Retain the ability to select PHYSICAL memory registration capability
via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it
if the HCA does not support FRWR or FMR.

This means amso1100 no longer works out of the box with NFS/RDMA.
When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before
performing NFS/RDMA mounts.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever 864be126fe xprtrdma: Raise maximum payload size to one megabyte
The point of larger rsize and wsize is to reduce the per-byte cost
of memory registration and deregistration. Modern HCAs can typically
handle a megabyte or more with a single registration operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Chuck Lever 5231eb9773 xprtrdma: Make xprt_setup_rdma() agnostic to family of server address
In particular, recognize when an IPv6 connection is bound.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-08-05 16:21:26 -04:00
Joe Stringer f58e5aa7b8 netfilter: conntrack: Use flags in nf_ct_tmpl_alloc()
The flags were ignored for this function when it was introduced. Also
fix the style problem in kzalloc.

Fixes: 0838aa7fc (netfilter: fix netns dependencies with conntrack
templates)
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-05 10:56:43 +02:00
David S. Miller 9dc20a6496 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) A couple of cleanups for the netfilter core hook from Eric Biederman.

2) Net namespace hook registration, also from Eric. This adds a dependency with
   the rtnl_lock. This should be fine by now but we have to keep an eye on this
   because if we ever get the per-subsys nfnl_lock before rtnl we have may
   problems in the future. But we have room to remove this in the future by
   propagating the complexity to the clients, by registering hooks for the init
   netns functions.

3) Update nf_tables to use the new net namespace hook infrastructure, also from
   Eric.

4) Three patches to refine and to address problems from the new net namespace
   hook infrastructure.

5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
   only applies to a very special case, the TEE target, but Eric Dumazet
   reports that this is slowing down things for everyone else. So let's only
   switch to the alternate jumpstack if the tee target is in used through a
   static key. This batch also comes with offline precalculation of the
   jumpstack based on the callchain depth. From Florian Westphal.

6) Minimal SCTP multihoming support for our conntrack helper, from Michal
   Kubecek.

7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
   Westphal.

8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.

9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-04 23:57:45 -07:00
Simon Wunderlich 27a4d5efd4 batman-adv: initialize up/down values when adding a gateway
Without this initialization, gateways which actually announce up/down
bandwidth of 0/0 could be added. If these nodes get purged via
_batadv_purge_orig() later, the gw_node structure does not get removed
since batadv_gw_node_delete() updates the gw_node with up/down
bandwidth of 0/0, and the updating function then discards the change
and does not free gw_node.

This results in leaking the gw_node structures, which references other
structures: gw_node -> orig_node -> orig_node_ifinfo -> hardif. When
removing the interface later, the open reference on the hardif may cause
hangs with the infamous "unregister_netdevice: waiting for mesh1 to
become free. Usage count = 1" message.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:47 +02:00
Marek Lindner ef72706a05 batman-adv: protect tt_local_entry from concurrent delete events
The tt_local_entry deletion performed in batadv_tt_local_remove() was neither
protecting against simultaneous deletes nor checking whether the element was
still part of the list before calling hlist_del_rcu().

Replacing the hlist_del_rcu() call with batadv_hash_remove() provides adequate
protection via hash spinlocks as well as an is-element-still-in-hash check to
avoid 'blind' hash removal.

Fixes: 068ee6e204 ("batman-adv: roaming handling mechanism redesign")
Reported-by: alfonsname@web.de
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:47 +02:00
Marek Lindner 354136bcc3 batman-adv: fix kernel crash due to missing NULL checks
batadv_softif_vlan_get() may return NULL which has to be verified
by the caller.

Fixes: 35df3b298f ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Reported-by: Ryan Thompson <ryan@eero.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:46 +02:00
Antonio Quartulli f202a666e9 batman-adv: avoid DAT to mess up LAN state
When a node running DAT receives an ARP request from the LAN for the
first time, it is likely that this node will request the ARP entry
through the distributed ARP table (DAT) in the mesh.

Once a DAT reply is received the asking node must check if the MAC
address for which the IP address has been asked is local. If it is, the
node must drop the ARP reply bceause the client should have replied on
its own locally.

Forwarding this reply means fooling any L2 bridge (e.g. Ethernet
switches) lying between the batman-adv node and the LAN. This happens
because the L2 bridge will think that the client sending the ARP reply
lies somewhere in the mesh, while this node is sitting in the same LAN.

Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2015-08-05 00:31:46 +02:00
Subash Abhinov Kasiviswanathan a6cd379b4d netfilter: ip6t_REJECT: Remove debug messages from reject_tg6()
Make it similar to reject_tg() in ipt_REJECT.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-08-04 11:12:51 +02:00
Robert Shearman a6affd24f4 mpls: Use definition for reserved label checks
In multiple locations there are checks for whether the label in hand
is a reserved label or not using the arbritray value of 16. Factor
this out into a #define for better maintainability and for
documentation.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:35:00 -07:00
Robert Shearman 0335f5b500 ipv4: apply lwtunnel encap for locally-generated packets
lwtunnel encap is applied for forwarded packets, but not for
locally-generated packets. This is because the output function is not
overridden in __mkroute_output, unlike it is in __mkroute_input.

The lwtunnel state is correctly set on the rth through the call to
rt_set_nexthop, so all that needs to be done is to override the dst
output function to be lwtunnel_output if there is lwtunnel state
present and it requires output redirection.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:26:14 -07:00
Robert Shearman abf7c1c540 lwtunnel: set skb protocol and dev
In the locally-generated packet path skb->protocol may not be set and
this is required for the lwtunnel encap in order to get the lwtstate.

This would otherwise have been set by ip_output or ip6_output so set
skb->protocol prior to calling the lwtunnel encap
function. Additionally set skb->dev in case it is needed further down
the transmit path.

Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:26:13 -07:00
Eric Dumazet 10e2eb878f udp: fix dst races with multicast early demux
Multicast dst are not cached. They carry DST_NOCACHE.

As mentioned in commit f886497212 ("ipv4: fix dst race in
sk_dst_get()"), these dst need special care before caching them
into a socket.

Caching them is allowed only if their refcnt was not 0, ie we
must use atomic_inc_not_zero()

Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
in commit d0c294c53a ("tcp: prevent fetching dst twice in early demux
code")

Fixes: 421b3885bf ("udp: ipv4: Add udp early demux")
Tested-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 22:16:50 -07:00
Nikolay Aleksandrov 58da018053 bridge: mdb: fix vlan_enabled access when vlans are not configured
Instead of trying to access br->vlan_enabled directly use the provided
helper br_vlan_enabled().

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 16:20:51 -07:00
Daniel Borkmann a5c90b29e5 act_bpf: properly support late binding of bpf action to a classifier
Since the introduction of the BPF action in d23b8ad8ab ("tc: add BPF
based action"), late binding was not working as expected. I.e. setting
the action part for a classifier only via 'bpf index <num>', where <num>
is the index of an existing action, is being rejected by the kernel due
to other missing parameters.

It doesn't make sense to require these parameters such as BPF opcodes
etc, as they are not going to be used anyway: in this case, they're just
allocated/parsed and then freed again w/o doing anything meaningful.

Instead, parse and verify the remaining parameters *after* the test on
tcf_hash_check(), when we really know that we're dealing with creation
of a new action or replacement of an existing one and where late binding
is thus irrelevant.

After patch, test case is now working:

  FOO="1,6 0 0 4294967295,"
  tc actions add action bpf bytecode "$FOO"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action bpf index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 2 bind 1
  tc filter show dev foo
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 flowid 1:1 bytecode '1,6 0 0 4294967295'
    action order 1: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 2 bind 1

Late binding of a BPF action can be useful for preloading maps (e.g. before
they hit traffic) in case of eBPF programs, or to share a single eBPF action
with multiple classifiers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 16:05:56 -07:00
Satish Ashok e44deb2f0c bridge: mdb: add/del entry on all vlans if vlan_filter is enabled and vid is 0
Before this patch when a vid was not specified, the entry was added with
vid 0 which is useless when vlan_filtering is enabled. This patch makes
the entry to be added on all configured vlans when vlan filtering is
enabled and respectively deleted from all, if the entry vid is 0.
This is also closer to the way fdb works with regard to vid 0 and vlan
filtering.

Example:
Setup:
$ bridge vlan add vid 256 dev eth4
$ bridge vlan add vid 1024 dev eth4
$ bridge vlan add vid 64 dev eth3
$ bridge vlan add vid 128 dev eth3
$ bridge vlan
port	vlan ids
eth3	 1 PVID Egress Untagged
	 64
	 128

eth4	 1 PVID Egress Untagged
	 256
	 1024
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering

Before:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp

After:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp vid 1
dev br0 port eth3 grp 239.0.0.1 temp vid 128
dev br0 port eth3 grp 239.0.0.1 temp vid 64

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 15:43:35 -07:00
Dan Carpenter 468b732b6f rds: fix an integer overflow test in rds_info_getsockopt()
"len" is a signed integer.  We check that len is not negative, so it
goes from zero to INT_MAX.  PAGE_SIZE is unsigned long so the comparison
is type promoted to unsigned long.  ULONG_MAX - 4095 is a higher than
INT_MAX so the condition can never be true.

I don't know if this is harmful but it seems safe to limit "len" to
INT_MAX - 4095.

Fixes: a8c879a7ee ('RDS: Info and stats')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 15:20:16 -07:00
Toshiaki Makita 6678053092 bridge: Don't segment multiple tagged packets on bridge device
Bridge devices don't need to segment multiple tagged packets since thier
ports can segment them.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:24:50 -07:00
WANG Cong 636dba8e12 act_mirred: avoid calling tcf_hash_release() when binding
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:13:28 -07:00
Glenn Griffin 3576fd794b openvswitch: Fix L4 checksum handling when dealing with IP fragments
openvswitch modifies the L4 checksum of a packet when modifying
the ip address. When an IP packet is fragmented only the first
fragment contains an L4 header and checksum. Prior to this change
openvswitch would modify all fragments, modifying application data
in non-first fragments, causing checksum failures in the
reassembled packet.

Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-03 14:03:08 -07:00
Daniel Borkmann ba7591d8b2 ebpf: add skb->hash to offset map for usage in {cls, act}_bpf or filters
Add skb->hash to the __sk_buff offset map, so it can be accessed from
an eBPF program. We currently already do this for classic BPF filters,
but not yet on eBPF, it might be useful as a demuxer in combination with
helpers like bpf_clone_redirect(), toy example:

  __section("cls-lb") int ingress_main(struct __sk_buff *skb)
  {
    unsigned int which = 3 + (skb->hash & 7);
    /* bpf_skb_store_bytes(skb, ...); */
    /* bpf_l{3,4}_csum_replace(skb, ...); */
    bpf_clone_redirect(skb, which, 0);
    return -1;
  }

I was thinking whether to add skb_get_hash(), but then concluded the
raw skb->hash seems fine in this case: we can directly access the hash
w/o extra eBPF helper function call, it's filled out by many NICs on
ingress, and in case the entropy level would not be sufficient, people
can still implement their own specific sw fallback hash mix anyway.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-02 17:20:47 -07:00
Eric Dumazet 3d0e0af406 fq_codel: explicitly reset flows in ->reset()
Alex reported the following crash when using fq_codel
with htb:

  crash> bt
  PID: 630839  TASK: ffff8823c990d280  CPU: 14  COMMAND: "tc"
   [... snip ...]
   #8 [ffff8820ceec17a0] page_fault at ffffffff8160a8c2
      [exception RIP: htb_qlen_notify+24]
      RIP: ffffffffa0841718  RSP: ffff8820ceec1858  RFLAGS: 00010282
      RAX: 0000000000000000  RBX: 0000000000000000  RCX: ffff88241747b400
      RDX: ffff88241747b408  RSI: 0000000000000000  RDI: ffff8811fb27d000
      RBP: ffff8820ceec1868   R8: ffff88120cdeff24   R9: ffff88120cdeff30
      R10: 0000000000000bd4  R11: ffffffffa0840919  R12: ffffffffa0843340
      R13: 0000000000000000  R14: 0000000000000001  R15: ffff8808dae5c2e8
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #9 [...] qdisc_tree_decrease_qlen at ffffffff81565375
  #10 [...] fq_codel_dequeue at ffffffffa084e0a0 [sch_fq_codel]
  #11 [...] fq_codel_reset at ffffffffa084e2f8 [sch_fq_codel]
  #12 [...] qdisc_destroy at ffffffff81560d2d
  #13 [...] htb_destroy_class at ffffffffa08408f8 [sch_htb]
  #14 [...] htb_put at ffffffffa084095c [sch_htb]
  #15 [...] tc_ctl_tclass at ffffffff815645a3
  #16 [...] rtnetlink_rcv_msg at ffffffff81552cb0
  [... snip ...]

As Jamal pointed out, there is actually no need to call dequeue
to purge the queued skb's in reset, data structures can be just
reset explicitly. Therefore, we reset everything except config's
and stats, so that we would have a fresh start after device flipping.

Fixes: 4b549a2ef4 ("fq_codel: Fair Queue Codel AQM")
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Alex Gartrell <agartrell@fb.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[xiyou.wangcong@gmail.com: added codel_vars_init() and qdisc_qstats_backlog_dec()]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-02 17:20:01 -07:00
David S. Miller 5510b3c2a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/s390/net/bpf_jit_comp.c
	drivers/net/ethernet/ti/netcp_ethss.c
	net/bridge/br_multicast.c
	net/ipv4/ip_fragment.c

All four conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 23:52:20 -07:00
Linus Torvalds 7c764cec37 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Must teardown SR-IOV before unregistering netdev in igb driver, from
    Alex Williamson.

 2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell.

 3) Default route selection in ipv4 should take the prefix length, table
    ID, and TOS into account, from Julian Anastasov.

 4) sch_plug must have a reset method in order to purge all buffered
    packets when the qdisc is reset, likewise for sch_choke, from WANG
    Cong.

 5) Fix deadlock and races in slave_changelink/br_setport in bridging.
    From Nikolay Aleksandrov.

 6) mlx4 bug fixes (wrong index in port even propagation to VFs,
    overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack
    Morgenstein, and Or Gerlitz.

 7) Turn off klog message about SCTP userspace interface compat that
    makes no sense at all, from Daniel Borkmann.

 8) Fix unbounded restarts of inet frag eviction process, causing NMI
    watchdog soft lockup messages, from Florian Westphal.

 9) Suspend/resume fixes for r8152 from Hayes Wang.

10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from
    Sabrina Dubroca.

11) Fix performance regression when removing a lot of routes from the
    ipv4 routing tables, from Alexander Duyck.

12) Fix device leak in AF_PACKET, from Lars Westerhoff.

13) AF_PACKET also has a header length comparison bug due to signedness,
    from Alexander Drozdov.

14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann.

15) Memory leaks, TSO stats, watchdog timeout and other fixes to
    thunderx driver from Sunil Goutham and Thanneeru Srinivasulu.

16) act_bpf can leak memory when replacing programs, from Daniel
    Borkmann.

17) WOL packet fixes in gianfar driver, from Claudiu Manoil.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
  stmmac: fix missing MODULE_LICENSE in stmmac_platform
  gianfar: Enable device wakeup when appropriate
  gianfar: Fix suspend/resume for wol magic packet
  gianfar: Fix warning when CONFIG_PM off
  act_pedit: check binding before calling tcf_hash_release()
  net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
  net: sched: fix refcount imbalance in actions
  r8152: reset device when tx timeout
  r8152: add pre_reset and post_reset
  qlcnic: Fix corruption while copying
  act_bpf: fix memory leaks when replacing bpf programs
  net: thunderx: Fix for crash while BGX teardown
  net: thunderx: Add PCI driver shutdown routine
  net: thunderx: Fix crash when changing rss with mutliple traffic flows
  net: thunderx: Set watchdog timeout value
  net: thunderx: Wakeup TXQ only if CQE_TX are processed
  net: thunderx: Suppress alloc_pages() failure warnings
  net: thunderx: Fix TSO packet statistic
  net: thunderx: Fix memory leak when changing queue count
  net: thunderx: Fix RQ_DROP miscalculation
  ...
2015-07-31 17:10:56 -07:00
Tom Herbert be26849bfb ipv6: Disable flowlabel state ranges by default
Per RFC6437 stateful flow labels (e.g. labels set by flow label manager)
cannot "disturb" nodes taking part in stateless flow labels. While the
ranges only reduce the flow label entropy by one bit, it is conceivable
that this might bias the algorithm on some routers causing a load
imbalance. For best results on the Internet we really need the full
20 bits.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Tom Herbert 42240901f7 ipv6: Implement different admin modes for automatic flow labels
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:

0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets

np->autoflowlabel is initialized according to the sysctl value.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Tom Herbert 67800f9b1f ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel
We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Tom Herbert f70ea018da net: Add functions to get skb->hash based on flow structures
Add skb_get_hash_flowi6 and skb_get_hash_flowi4 which derive an sk_buff
hash from flowi6 and flowi4 structures respectively. These functions
can be called when creating a packet in the output path where the new
sk_buff does not yet contain a fully formed packet that is parsable by
flow dissector.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 17:07:11 -07:00
Florian Fainelli 04ff53f96a net: dsa: Add netconsole support
Add support for using DSA slave network devices with netconsole, which
requires us to allocate and free custom netpoll instances and invoke the
parent network device poll controller callback.

In order for netconsole to work, we need to construct the DSA tag, but
not queue the skb for transmission on the master network device xmit
function.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:45:37 -07:00
Florian Fainelli 4ed70ce9f0 net: dsa: Refactor transmit path to eliminate duplication
All tagging protocols do the same thing: increment device statistics,
make room for the tag to be inserted, create the tag, invoke the parent
network device transmit function.

In order to prepare for adding netpoll support, which requires the tag
creation, but not using the parent network device transmit function, do
some little refactoring which eliminates duplication between the 4
tagging protocols supported.

We need to return a sk_buff pointer back to the caller because the tag
specific transmit function may have to reallocate the original skb (e.g:
tag_trailer.c) and this is the one we should be transmitting, not the
original sk_buff we were passed.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:45:37 -07:00
Joe Perches 85b1d8bbfd br2684: Remove unnecessary formatting macros b1 and bs
Use vsprintf extension %pI4 instead.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:25:52 -07:00
WANG Cong 5175f7106c act_pedit: check binding before calling tcf_hash_release()
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().

Fixes: 1a29321ed0 ("net_sched: act: Dont increment refcnt on replace")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:22:34 -07:00
Roopa Prabhu bf21563acc af_mpls: fix undefined reference to ip6_route_output
Undefined reference to ip6_route_output and ip_route_output
was reported with CONFIG_INET=n and CONFIG_IPV6=n.

This patch uses ipv6_stub_impl.ipv6_dst_lookup instead of
ip6_route_output. And wraps affected code under
IS_ENABLED(CONFIG_INET) and IS_ENABLED(CONFIG_IPV6).

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Roopa Prabhu 343d60aada ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.

All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Alexei Starovoitov d3aa45ce6b bpf: add helpers to access tunnel metadata
Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
skb: pointer to skb
key: pointer to 'struct bpf_tunnel_key'
size: size of 'struct bpf_tunnel_key'
flags: room for future extensions

First eBPF program that uses these helpers will allocate per_cpu
metadata_dst structures that will be used on TX.
On RX metadata_dst is allocated by tunnel driver.

Typical usage for TX:
struct bpf_tunnel_key tkey;
... populate tkey ...
bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);

RX:
struct bpf_tunnel_key tkey = {};
bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
... lookup or redirect based on tkey ...

'struct bpf_tunnel_key' will be extended in the future by adding
elements to the end and the 'size' argument will indicate which fields
are populated, thereby keeping backwards compatibility.
The 'flags' argument may be used as well when the 'size' is not enough or
to indicate completely different layout of bpf_tunnel_key.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:20:22 -07:00
Jon Paul Maloy 440d8963cd tipc: clean up link creation
We simplify the link creation function tipc_link_create() and the way
the link struct it is connected to the node struct. In particular, we
remove the duplicate initialization of some fields which are anyway set
in tipc_link_reset().

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:15 -07:00
Jon Paul Maloy 9073fb8be3 tipc: use temporary, non-protected skb queue for bundle reception
Currently, when we extract small messages from a message bundle, or
when many messages have accumulated in the link arrival queue, those
messages are added one by one to the lock protected link input queue.
This may increase contention with the reader of that queue, in
the function tipc_sk_rcv().

This commit introduces a temporary, unprotected input queue in
tipc_link_rcv() for such cases. Only when the arrival queue has been
emptied, and the function is ready to return, does it splice the whole
temporary queue into the real input queue.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:15 -07:00
Jon Paul Maloy 23d8335d78 tipc: remove implicit message delivery in node_unlock()
After the most recent changes, all access calls to a link which
may entail addition of messages to the link's input queue are
postpended by an explicit call to tipc_sk_rcv(), using a reference
to the correct queue.

This means that the potentially hazardous implicit delivery, using
tipc_node_unlock() in combination with a binary flag and a cached
queue pointer, now has become redundant.

This commit removes this implicit delivery mechanism both for regular
data messages and for binding table update messages.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 598411d70f tipc: make resetting of links non-atomic
In order to facilitate future improvements to the locking structure, we
want to make resetting and establishing of links non-atomic. I.e., the
functions tipc_node_link_up() and tipc_node_link_down() should be called
from outside the node lock context, and grab/release the node lock
themselves. This requires that we can freeze the link state from the
moment it is set to RESETTING or PEER_RESET in one lock context until
it is set to RESET or ESTABLISHING in a later context. The recently
introduced link FSM makes this possible, so we are now ready to introduce
the above change.

This commit implements this.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy cf148816ac tipc: move received discovery data evaluation inside node.c
The node lock is currently grabbed and and released in the function
tipc_disc_rcv() in the file discover.c. As a preparation for the next
commits, we need to move this node lock handling, along with the code
area it is covering, to node.c.

This commit introduces this change.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 662921cd0a tipc: merge link->exec_mode and link->state into one FSM
Until now, we have been handling link failover and synchronization
by using an additional link state variable, "exec_mode". This variable
is not independent of the link FSM state, something causing a risk of
inconsistencies, apart from the fact that it clutters the code.

The conditions are now in place to define a new link FSM that covers
all existing use cases, including failover and synchronization, and
eliminate the "exec_mode" field altogether. The FSM must also support
non-atomic resetting of links, which will be introduced later.

The new link FSM is shown below, with 7 states and 8 events.
Only events leading to state change are shown as edges.

+------------------------------------+
|RESET_EVT                           |
|                                    |
|                             +--------------+
|           +-----------------|   SYNCHING   |-----------------+
|           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
|           |                  A            |                  |
|           |                  |            |                  |
|           |                  |            |                  |
|           |                  |SYNCH_      |SYNCH_            |
|           |                  |BEGIN_EVT   |END_EVT           |
|           |                  |            |                  |
|           V                  |            V                  V
|    +-------------+          +--------------+          +------------+
|    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
|    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
|           |        EVT        |    A         RESET_EVT       |
|           |                   |    |                         |
|           |                   |    |                         |
|           |    +--------------+    |                         |
|  RESET_EVT|    |RESET_EVT          |ESTABLISH_EVT            |
|           |    |                   |                         |
|           |    |                   |                         |
|           V    V                   |                         |
|    +-------------+          +--------------+        RESET_EVT|
+--->|    RESET    |--------->| ESTABLISHING |<----------------+
     +-------------+ PEER_    +--------------+
      |           A  RESET_EVT       |
      |           |                  |
      |           |                  |
      |FAILOVER_  |FAILOVER_         |FAILOVER_
      |BEGIN_EVT  |END_EVT           |BEGIN_EVT
      |           |                  |
      V           |                  |
     +-------------+                 |
     | FAILINGOVER |<----------------+
     +-------------+

These changes are fully backwards compatible.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 5045f7b900 tipc: move protocol message sending away from link FSM
The implementation of the link FSM currently takes decisions about and
sends out link protocol messages. This is unnecessary, since such
actions are not the result of any link state change, and are even
decided based on non-FSM state information ("silent_intv_cnt").

We now move the sending of unicast link protocol messages to the
function tipc_link_timeout(), and the initial broadcast synchronization
message to tipc_node_link_up(). The latter is done because a link
instance should not need to know whether it is the first or second
link to a destination. Such information is now restricted to and
handled by the link aggregation layer in node.c

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 6e498158a8 tipc: move link synch and failover to link aggregation level
Link failover and synchronization have until now been handled by the
links themselves, forcing them to have knowledge about and to access
parallel links in order to make the two algorithms work correctly.

In this commit, we move the control part of this functionality to the
link aggregation level in node.c, which is the right location for this.
As a result, the two algorithms become easier to follow, and the link
implementation becomes simpler.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 66996b6c47 tipc: extend node FSM
In the next commit, we will move link synch/failover orchestration to
the link aggregation level. In order to do this, we first need to extend
the node FSM with two more states, NODE_SYNCHING and NODE_FAILINGOVER,
plus four new events to enter and leave those states.

This commit introduces this change, without yet making use of it.
The node FSM now looks as follows:

                           +-----------------------------------------+
                           |                            PEER_DOWN_EVT|
                           |                                         |
  +------------------------+----------------+                        |
  |SELF_DOWN_EVT           |                |                        |
  |                        |                |                        |
  |              +-----------+          +-----------+                |
  |              |NODE_      |          |NODE_      |                |
  |   +----------|FAILINGOVER|<---------|SYNCHING   |------------+   |
  |   |SELF_     +-----------+ FAILOVER_+-----------+    PEER_   |   |
  |   |DOWN_EVT   |         A  BEGIN_EVT A         |     DOWN_EVT|   |
  |   |           |         |            |         |             |   |
  |   |           |         |            |         |             |   |
  |   |           |FAILOVER_|FAILOVER_   |SYNCH_   |SYNCH_       |   |
  |   |           |END_EVT  |BEGIN_EVT   |BEGIN_EVT|END_EVT      |   |
  |   |           |         |            |         |             |   |
  |   |           |         |            |         |             |   |
  |   |           |        +--------------+        |             |   |
  |   |           +------->|   SELF_UP_   |<-------+             |   |
  |   |   +----------------|   PEER_UP    |------------------+   |   |
  |   |   |SELF_DOWN_EVT   +--------------+     PEER_DOWN_EVT|   |   |
  |   |   |                   A          A                   |   |   |
  |   |   |                   |          |                   |   |   |
  |   |   |        PEER_UP_EVT|          |SELF_UP_EVT        |   |   |
  |   |   |                   |          |                   |   |   |
  V   V   V                   |          |                   V   V   V
+------------+       +-----------+    +-----------+       +------------+
|SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
|PEER_LEAVING|<------|PEER_COMING|    |SELF_COMING|------>|SELF_LEAVING|
+------------+ SELF_ +-----------+    +-----------+ PEER_ +------------+
       |       DOWN_EVT       A          A          DOWN_EVT     |
       |                      |          |                       |
       |                      |          |                       |
       |           SELF_UP_EVT|          |PEER_UP_EVT            |
       |                      |          |                       |
       |                      |          |                       |
       |PEER_DOWN_EVT       +--------------+        SELF_DOWN_EVT|
       +------------------->|  SELF_DOWN_  |<--------------------+
                            |  PEER_DOWN   |
                            +--------------+

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy 655fb243b8 tipc: reverse call order for link_reset()->node_link_down()
In many cases the call order when a link is reset goes as follows:
tipc_node_xx()->tipc_link_reset()->tipc_node_link_down()

This is not the right order if we want the node to be in control,
so in this commit we change the order to:
tipc_node_xx()->tipc_node_link_down()->tipc_link_reset()

The fact that tipc_link_reset() now is called from only one
location with a well-defined state will also facilitate later
simplifications of tipc_link_reset() and the link FSM.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy 6144a996a6 tipc: move all link_reset() calls to link aggregation level
In line with our effort to let the node level have full control over
its links, we want to move all link reset calls from link.c to node.c.
Some of the calls can be moved by simply moving the calling function,
when this is the right thing to do. For the remaining calls we use
the now established technique of returning a TIPC_LINK_DOWN_EVT
flag from tipc_link_rcv(), whereafter we perform the reset call when
the call returns.

This change serves as a preparation for the coming commits.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy cbeb83ca68 tipc: eliminate function tipc_link_activate()
The function tipc_link_activate() is redundant, since it mostly performs
settings that have already been done in a preceding tipc_link_reset().

There are three exceptions to this:
- The actual state change to TIPC_LINK_WORKING. This should anyway be done
  in the FSM, and not in a separate function.
- Registration of the link with the bearer. This should be done by the
  node, since we don't want the link to have any knowledge about its
  specific bearer.
- Call to tipc_node_link_up() for user access registration. With the new
  role distribution between link aggregation and link level this becomes
  the wrong call order; tipc_node_link_up() should instead be called
  directly as a result of a TIPC_LINK_UP event, hence by the node itself.

This commit implements those changes.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
David S. Miller 29a3060aa7 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-07-30

Here's a set of Bluetooth & 802.15.4 patches intended for the 4.3 kernel.

 - Cleanups & fixes to mac802154
 - Refactoring of Intel Bluetooth HCI driver
 - Various coding style fixes to Bluetooth HCI drivers
 - Support for Intel Lightning Peak Bluetooth devices
 - Generic class code in interface descriptor in btusb to match more HW
 - Refactoring of Bluetooth HS code together with a new config option
 - Support for BCM4330B1 Broadcom UART controller

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 16:16:43 -07:00
Sowmini Varadhan 8a68173691 net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the
      netns of kernel sockets.")
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 15:59:12 -07:00
Hangbin Liu 8013d1d7ea net/ipv6: add sysctl option accept_ra_min_hop_limit
Commit 6fd99094de ("ipv6: Don't reduce hop limit for an interface")
disabled accept hop limit from RA if it is smaller than the current hop
limit for security stuff. But this behavior kind of break the RFC definition.

RFC 4861, 6.3.4.  Processing Received Router Advertisements
   A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
   and Retrans Timer) may contain a value denoting that it is
   unspecified.  In such cases, the parameter should be ignored and the
   host should continue using whatever value it is already using.

   If the received Cur Hop Limit value is non-zero, the host SHOULD set
   its CurHopLimit variable to the received value.

So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
hop limit value they can accept from RA. And set default to 1 to meet RFC
standards.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 15:56:40 -07:00
Daniel Borkmann 28e6b67f0b net: sched: fix refcount imbalance in actions
Since commit 55334a5db5 ("net_sched: act: refuse to remove bound action
outside"), we end up with a wrong reference count for a tc action.

Test case 1:

  FOO="1,6 0 0 4294967295,"
  BAR="1,6 0 0 4294967294,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \
     action bpf bytecode "$FOO"
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 1 bind 1
  tc actions replace action bpf bytecode "$BAR" index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
    index 1 ref 2 bind 1
  tc actions replace action bpf bytecode "$FOO" index 1
  tc actions show action bpf
    action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
    index 1 ref 3 bind 1

Test case 2:

  FOO="1,6 0 0 4294967295,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
  tc actions show action gact
    action order 0: gact action pass
    random type none pass val 0
     index 1 ref 1 bind 1
  tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
  tc actions show action gact
    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 2 bind 1
  tc actions add action drop index 1
    RTNETLINK answers: File exists [...]
  tc actions show action gact
    action order 0: gact action pass
     random type none pass val 0
     index 1 ref 3 bind 1

What happens is that in tcf_hash_check(), we check tcf_common for a given
index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've
found an existing action. Now there are the following cases:

  1) We do a late binding of an action. In that case, we leave the
     tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init()
     handler. This is correctly handeled.

  2) We replace the given action, or we try to add one without replacing
     and find out that the action at a specific index already exists
     (thus, we go out with error in that case).

In case of 2), we have to undo the reference count increase from
tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to
do so because of the 'tcfc_bindcnt > 0' check which bails out early with
an -EPERM error.

Now, while commit 55334a5db5 prevents 'tc actions del action ...' on an
already classifier-bound action to drop the reference count (which could
then become negative, wrap around etc), this restriction only accounts for
invocations outside a specific action's ->init() handler.

One possible solution would be to add a flag thus we possibly trigger
the -EPERM ony in situations where it is indeed relevant.

After the patch, above test cases have correct reference count again.

Fixes: 55334a5db5 ("net_sched: act: refuse to remove bound action outside")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 14:20:39 -07:00
Alexander Aring 5857d1dbae Bluetooth: 6lowpan: Fix possible race
This patch fix a possible race after calling register_netdev. After
calling netdev_register it could be possible that netdev_ops callbacks
use the uninitialized private data of lowpan_dev. By moving the
initialization of this data before netdev_register we can be sure that
initialized private data is be used after netdev_register.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 14:11:36 +02:00
Lennert Buytenhek c22ff7b4e7 mac802154: Fix memory corruption with global deferred transmit state.
When transmitting a packet via a mac802154 driver that can sleep in
its transmit function, mac802154 defers the call to the driver's
transmit function to a per-device workqueue.

However, mac802154 uses a single global work_struct for this, which
means that if you have more than one registered mac802154 interface
in the system, and you transmit on more than one of them at the same
time, you'll very easily cause memory corruption.

This patch moves the deferred transmit processing state from global
variables to struct ieee802154_local, and this seems to fix the memory
corruption issue.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 14:08:55 +02:00
Dan Carpenter 1a727c6361 netfilter: nf_conntrack: checking for IS_ERR() instead of NULL
We recently changed this from nf_conntrack_alloc() to nf_ct_tmpl_alloc()
so the error handling needs to changed to check for NULL instead of
IS_ERR().

Fixes: 0838aa7fcf ('netfilter: fix netns dependencies with conntrack templates')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 14:04:19 +02:00
Christophe JAILLET 54c9ee3992 Bluetooth: cmtp: Do not use list_for_each_safe when not needed
There is no need to use the safe version of list_for_each here.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:50:35 +02:00
Bernhard Thaler f4b3eee727 netfilter: bridge: do not initialize statics to 0 or NULL
Fix checkpatch.pl "ERROR: do not initialise statics to 0 or NULL" for
all statics explicitly initialized to 0.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 13:46:04 +02:00
Florian Westphal 72b1e5e4ca netfilter: bridge: reduce nf_bridge_info to 32 bytes again
We can use union for most of the temporary cruft (original ipv4/ipv6
address, source mac, physoutdev) since they're used during different
stages of br netfilter traversal.

Also get rid of the last two ->mask users.

Shrinks struct from 48 to 32 on 64bit arch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 13:37:42 +02:00
Arron Wang df9b89c7e4 Bluetooth: Move create/accept phy link completed callback to amp.c
To avoid amp module hooks from hci_event.c

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:37:22 +02:00
Arron Wang b3d3914006 Bluetooth: Move amp assoc read/write completed callback to amp.c
To avoid amp module hooks from hci_event.c

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:37:22 +02:00
Arron Wang 839278823c Bluetooth: Move get info completed callback to a2mp.c
To avoid a2mp module hooks from hci_event.c and send
getinfo response operation only required by a2mp module,
we can move this callback to a2mp.c

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:37:22 +02:00
Arron Wang a77a6a14e5 Bluetooth: Move high speed specific event under BT_HS option
Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:31:59 +02:00
Arron Wang 244bc37759 Bluetooth: Add BT_HS config option
Move A2MP Module under BT_HS config option and allow
the user have flexible option to choose the feature only
they need

a2mp_discover_amp() & a2mp_channel_create() are a2mp module
entry point for master and slave, and this is dynamic
invoked depends on the userspace or remote request, then
we defined their implementation depends on BT_HS config

Signed-off-by: Arron Wang <arron.wang@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-30 13:31:59 +02:00
Michal Kubeček d7ee351904 netfilter: nf_ct_sctp: minimal multihoming support
Currently nf_conntrack_proto_sctp module handles only packets between
primary addresses used to establish the connection. Any packets between
secondary addresses are classified as invalid so that usual firewall
configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to
establish a new conntrack would allow traffic between secondary
addresses to pass through. A more sophisticated solution based on the
addresses advertised in the initial handshake (and possibly also later
dynamic address addition and removal) would be much harder to implement.
Moreover, in general we cannot assume to always see the initial
handshake as it can be routed through a different path.

The patch adds two new conntrack states:

  SCTP_CONNTRACK_HEARTBEAT_SENT  - a HEARTBEAT chunk seen but not acked
  SCTP_CONNTRACK_HEARTBEAT_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK

State transition rules:

- HEARTBEAT_SENT responds to usual chunks the same way as NONE (so that
  the behaviour changes as little as possible)
- HEARTBEAT_ACKED responds to usual chunks the same way as ESTABLISHED
  does, except the resulting state is HEARTBEAT_ACKED rather than
  ESTABLISHED
- previously existing states except NONE are preserved when HEARTBEAT or
  HEARTBEAT-ACK is seen
- NONE (in the initial direction) changes to HEARTBEAT_SENT on HEARTBEAT
  and to CLOSED on HEARTBEAT-ACK
- HEARTBEAT_SENT changes to HEARTBEAT_ACKED on HEARTBEAT-ACK in the
  reply direction
- HEARTBEAT_SENT and HEARTBEAT_ACKED are preserved on HEARTBEAT and
  HEARTBEAT-ACK otherwise

Normally, vtag is set from the INIT chunk for the reply direction and
from the INIT-ACK chunk for the originating direction (i.e. each of
these defines vtag value for the opposite direction). For secondary
conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
seen them, we would need to connect two different conntracks. Therefore
simplified logic is applied: vtag of first packet in each direction
(HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
saved and all following packets in that direction are compared with this
saved value. While INIT and INIT-ACK define vtag for the opposite
direction, vtags extracted from HEARTBEAT and HEARTBEAT-ACK are always
for their direction.

Default timeout values for new states are

  HEARTBEAT_SENT: 30 seconds (default hb_interval)
  HEARTBEAT_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto)

(We cannot expect to see the shutdown sequence so that, unlike
ESTABLISHED, the HEARTBEAT_ACKED timeout shouldn't be too long.)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-30 12:59:25 +02:00
Pablo Neira Ayuso f0ad462189 netfilter: nf_conntrack: silence warning on falling back to vmalloc()
Since 88eab472ec ("netfilter: conntrack: adjust nf_conntrack_buckets default
value"), the hashtable can easily hit this warning. We got reports from users
that are getting this message in a quite spamming fashion, so better silence
this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Florian Westphal <fw@strlen.de>
2015-07-30 12:32:55 +02:00
Daniel Borkmann f4eaed28c7 act_bpf: fix memory leaks when replacing bpf programs
We currently trigger multiple memory leaks when replacing bpf
actions, besides others:

  comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s)
  hex dump (first 32 bytes):
    01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00  ................
    18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00  ...m............
  backtrace:
    [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0
    [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0
    [<ffffffff8120a37a>] __vmalloc+0x4a/0x50
    [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0
    [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0
    [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf]
    [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0
    [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0
    [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0
    [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf]
    [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf]
    [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910
    [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240
    [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0
    [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40
    [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0

Issue is that the old content from tcf_bpf is allocated and needs
to be released when we replace it. We seem to do that since the
beginning of act_bpf on the filter and insns, later on the name as
well.

Example test case, after patch:

  # FOO="1,6 0 0 4294967295,"
  # BAR="1,6 0 0 4294967294,"
  # tc actions add action bpf bytecode "$FOO" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
   index 2 ref 1 bind 0
  # tc actions replace action bpf bytecode "$BAR" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe
   index 2 ref 1 bind 0
  # tc actions replace action bpf bytecode "$FOO" index 2
  # tc actions show action bpf
   action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
   index 2 ref 1 bind 0
  # tc actions del action bpf index 2
  [...]
  # echo "scan" > /sys/kernel/debug/kmemleak
  # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l
  0

Fixes: d23b8ad8ab ("tc: add BPF based action")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:56:22 -07:00
Thomas Graf dcc38c033b openvswitch: Re-add CONFIG_OPENVSWITCH_VXLAN
This readds the config option CONFIG_OPENVSWITCH_VXLAN to avoid a
hard dependency of OVS on VXLAN. It moves the VXLAN config compat
code to vport-vxlan.c and allows compliation as a module.

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Fixes: 2661371ace ("openvswitch: fix compilation when vxlan is a module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:03:10 -07:00
Eric Dumazet c8507fb235 ipv6: flush nd cache on IFF_NOARP change
This patch is the IPv6 equivalent of commit
6c8b4e3ff8 ("arp: flush arp cache on IFF_NOARP change")

Without it, we keep buggy neighbours in the cache, with destination
MAC address equal to our own MAC address.

Tested:
 tcpdump -i eth0 -s 0 ip6 -n -e &
 ip link set dev eth0 arp off
 ping6 remote   // sends buggy frames
 ip link set dev eth0 arp on
 ping6 remote   // should work once kernel is patched

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mario Fanelli <mariofanelli@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:01:39 -07:00
Bogdan Hamciuc 1ce4b2f44d net: pktgen: Remove unused 'allocated_skbs' field
Field pktgen_dev.allocated_skbs had been written to, but never read
from. The number of allocated skbs can be deduced anyway, from the total
number of sent packets and the 'clone_skb' param.

Signed-off-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:00:39 -07:00
Bogdan Hamciuc 879c7220e8 net: pktgen: Observe needed_headroom of the device
Allocate enough space so as not to force the outgoing net device to do
skb_realloc_headroom().

Signed-off-by: Bogdan Hamciuc <bogdan.hamciuc@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 23:00:39 -07:00
Thomas Graf 92a99bf3ba lwtunnel: Make lwtun_encaps[] static
Any external user should use the registration API instead of
accessing this directly.

Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 22:59:39 -07:00
Tom Herbert 877d1f6291 net: Set sk_txhash from a random number
This patch creates sk_set_txhash and eliminates protocol specific
inet_set_txhash and ip6_set_txhash. sk_set_txhash simply sets a
random number instead of performing flow dissection. sk_set_txash
is also allowed to be called multiple times for the same socket,
we'll need this when redoing the hash for negative routing advice.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 22:44:04 -07:00
Jon Maloy 5a4c355229 tipc: fix bug in broadcast synch message create function
In commit d999297c3d
("tipc: reduce locking scope during packet reception") we introduced
a new function tipc_build_bcast_sync_msg(), which carries initial
synchronization data between two nodes at first contact and at
re-contact. In this function, we missed to add synchronization data,
with the effect that the broadcast link endpoints will fail to
synchronize correctly at re-contact between a running and a restarted
node. All other cases work as intended.

With this commit, we fix this bug.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 16:48:16 -07:00
Tobias Klauser 73d0fcf2f4 packet: remove handling of tx_ring from prb_shutdown_retire_blk_timer()
Follow e8e85cc5eb ("packet: remove handling of tx_ring") and remove
the tx_ring parameter from prb_shutdown_retire_blk_timer() as it is only
called with tx_ring = 0.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 15:11:12 -07:00
Nikolay Aleksandrov 7ae90a4f96 bridge: mdb: fix delmdb state in the notification
Since mdb states were introduced when deleting an entry the state was
left as it was set in the delete request from the user which leads to
the following output when doing a monitor (for example):
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 temp
^^^
Note the "temp" state in the delete notification which is wrong since
the entry was permanent, the state in a delete is always reported as
"temp" regardless of the real state of the entry.

After this patch:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent

There's one important note to make here that the state is actually not
matched when doing a delete, so one can delete a permanent entry by
stating "temp" in the end of the command, I've chosen this fix in order
not to break user-space tools which rely on this (incorrect) behaviour.

So to give an example after this patch and using the wrong state:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent
$ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp
(monitor) dev br0 port eth3 grp 239.0.0.1 permanent

Note the state of the entry that got deleted is correct in the
notification.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: ccb1c31a7a ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 15:02:30 -07:00
Satish Ashok 544586f742 bridge: mcast: give fast leave precedence over multicast router and querier
When fast leave is configured on a bridge port and an IGMP leave is
received for a group, the group is not deleted immediately if there is
a router detected or if multicast querier is configured.
Ideally the group should be deleted immediately when fast leave is
configured.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 14:57:05 -07:00
Toshiaki Makita df356d5e81 bridge: Fix network header pointer for vlan tagged packets
There are several devices that can receive vlan tagged packets with
CHECKSUM_PARTIAL like tap, possibly veth and xennet.
When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded
by bridge to a device with the IP_CSUM feature, they end up with checksum
error because before entering bridge, the network header is set to
ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(),
get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later.

Since the network header is exepected to be ETH_HLEN in flow-dissection
and hash-calculation in RPS in rx path, and since the header pointer fix
is needed only in tx path, set the appropriate network header on forwarding
packets.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 12:20:16 -07:00
Alexander Drozdov dbd46ab412 packet: tpacket_snd(): fix signed/unsigned comparison
tpacket_fill_skb() can return a negative value (-errno) which
is stored in tp_len variable. In that case the following
condition will be (but shouldn't be) true:

tp_len > dev->mtu + dev->hard_header_len

as dev->mtu and dev->hard_header_len are both unsigned.

That may lead to just returning an incorrect EMSGSIZE errno
to the user.

Fixes: 52f1454f62 ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case")
Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 00:09:58 -07:00
Eric Dumazet 11c91ef98f arp: filter NOARP neighbours for SIOCGARP
When arp is off on a device, and ioctl(SIOCGARP) is queried,
a buggy answer is given with MAC address of the device, instead
of the mac address of the destination/gateway.

We filter out NUD_NOARP neighbours for /proc/net/arp,
we must do the same for SIOCGARP ioctl.

Tested:

lpaa23:~# ./arp 10.246.7.190
MAC=00:01:e8:22:cb:1d      // correct answer

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# cat /proc/net/arp   # check arp table is now 'empty'
IP address       HW type     Flags       HW address    Mask     Device
lpaa23:~# ./arp 10.246.7.190
MAC=00:1a:11:c3:0d:7f   // buggy answer before patch (this is eth0 mac)

After patch :

lpaa23:~# ip link set dev eth0 arp off
lpaa23:~# ./arp 10.246.7.190
ioctl(SIOCGARP) failed: No such device or address

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Vytautas Valancius <valas@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:41:24 -07:00
David Ward 865b804244 net/ipv4: suppress NETDEV_UP notification on address lifetime update
This notification causes the FIB to be updated, which is not needed
because the address already exists, and more importantly it may undo
intentional changes that were made to the FIB after the address was
originally added. (As a point of comparison, when an address becomes
deprecated because its preferred lifetime expired, a notification on
this chain is not generated.)

The motivation for this commit is fixing an incompatibility between
DHCP clients which set and update the address lifetime according to
the lease, and a commercial VPN client which replaces kernel routes
in a way that outbound traffic is sent only through the tunnel (and
disconnects if any further route changes are detected via netlink).

Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:38:13 -07:00
Nikolay Aleksandrov 76b91c32dd bridge: stp: when using userspace stp stop kernel hello and hold timers
These should be handled only by the respective STP which is in control.
They become problematic for devices with limited resources with many
ports because the hold_timer is per port and fires each second and the
hello timer fires each 2 seconds even though it's global. While in
user-space STP mode these timers are completely unnecessary so it's better
to keep them off.
Also ensure that when the bridge is up these timers are started only when
running with kernel STP.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-28 23:33:20 -07:00
Linus Torvalds d8132e08d2 NFS client bugfixes for Linux 4.2
Highlights include:
 
 Stable patches:
 - Fix a situation where the client uses the wrong (zero) stateid.
 - Fix a memory leak in nfs_do_recoalesce
 
 Bugfixes:
 - Plug a memory leak when ->prepare_layoutcommit fails
 - Fix an Oops in the NFSv4 open code
 - Fix a backchannel deadlock
 - Fix a livelock in sunrpc when sendmsg fails due to low memory availability
 - Don't revalidate the mapping if both size and change attr are up to date
 - Ensure we don't miss a file extension when doing pNFS
 - Several fixes to handle NFSv4.1 sequence operation status bits correctly
 - Several pNFS layout return bugfixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVt6RGAAoJEGcL54qWCgDyiDIP/2+fUM7Tc1llCxYbM2WLC6Ar
 34v5yVwO96MqhI4L2mXB5FJvr4LP2/EZ4ZExMcf4ymT7pgJnjFK4nEv9IHUSy6xb
 ea+oS9GjvFSeGdkukJLRniNER5/ZG3GWkojlHNJCgByoIVRK4ISXF/qL9w2sedGw
 +5ejvjqie9NmBnBXMq8DRlU+kXhVYCF6E9qWATwUNK5Eq2eeQnDbA2w9ACSBVK3W
 LhCvZi0eBq7krSbHob018PmlQ0VPvmYwk5xL4d//FvcaNj/utk82VjAZCdKOK1sH
 qn8hcKgVeVko/3jwcUp6m3zAkKZ1IX/XaXJeHbosnKG/g0vy3hQirpa/g2iDTQ4H
 NXOSwcsd6syReZDZbQTxbvaSOp5ACxZAQKYLnlPerJ/hMpXDQCEAwyeAFKzEaKz4
 FfF0VJF+30w9PJk3wgk2DF66xbYVfHyvrLtVcb/ki8gb91cH09i+nFFSSfHQBMLh
 +ciHg7rOyXnbXoCaW9fBvONz2sCYDwbHATmhpWWZIx/3UTDf5owxHFa3BFDgGKnD
 jyiPjMh6I3JUE+Qm1zwInsfsskBKRSl2BdJgTHBGY5ODuQGF/sogOmvgbrT7Ox3t
 kbL8nzCydqLixM+4aw61nYakZqgDsKNER5Ggr+lkv4AZ2dH6IeP2IZjuoHLLylvZ
 dyqHwpCjoUtmYAUr166U
 =wlUD
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - Fix a situation where the client uses the wrong (zero) stateid.
   - Fix a memory leak in nfs_do_recoalesce

  Bugfixes:
   - Plug a memory leak when ->prepare_layoutcommit fails
   - Fix an Oops in the NFSv4 open code
   - Fix a backchannel deadlock
   - Fix a livelock in sunrpc when sendmsg fails due to low memory
     availability
   - Don't revalidate the mapping if both size and change attr are up to
     date
   - Ensure we don't miss a file extension when doing pNFS
   - Several fixes to handle NFSv4.1 sequence operation status bits
     correctly
   - Several pNFS layout return bugfixes"

* tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
  nfs: Fix an oops caused by using other thread's stack space in ASYNC mode
  nfs: plug memory leak when ->prepare_layoutcommit fails
  SUNRPC: Report TCP errors to the caller
  sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
  NFSv4.2: handle NFS-specific llseek errors
  NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce()
  NFS: Fix a memory leak in nfs_do_recoalesce
  NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE
  NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability
  NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised
  NFS: Don't revalidate the mapping if both size and change attr are up to date
  NFSv4/pnfs: Ensure we don't miss a file extension
  NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked
  SUNRPC: xprt_complete_bc_request must also decrement the free slot count
  SUNRPC: Fix a backchannel deadlock
  pNFS: Don't throw out valid layout segments
  pNFS: pnfs_roc_drain() fix a race with open
  pNFS: Fix races between return-on-close and layoutreturn.
  pNFS: pnfs_roc_drain should return 'true' when sleeping
  pNFS: Layoutreturn must invalidate all existing layout segments.
  ...
2015-07-28 09:37:44 -07:00
Lars Westerhoff 158cd4af8d packet: missing dev_put() in packet_do_bind()
When binding a PF_PACKET socket, the use count of the bound interface is
always increased with dev_hold in dev_get_by_{index,name}.  However,
when rebound with the same protocol and device as in the previous bind
the use count of the interface was not decreased.  Ultimately, this
caused the deletion of the interface to fail with the following message:

unregister_netdevice: waiting for dummy0 to become free. Usage count = 1

This patch moves the dev_put out of the conditional part that was only
executed when either the protocol or device changed on a bind.

Fixes: 902fefb82e ('packet: improve socket create/bind latency in some cases')
Signed-off-by: Lars Westerhoff <lars.westerhoff@newtec.eu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 15:38:58 -07:00
Trond Myklebust f580dd0428 SUNRPC: Report TCP errors to the caller
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 17:56:57 -04:00
Alexander Duyck 1513069edc fib_trie: Drop unnecessary calls to leaf_pull_suffix
It was reported that update_suffix was taking a long time on systems where
a large number of leaves were attached to a single node.  As it turns out
fib_table_flush was calling update_suffix for each leaf that didn't have all
of the aliases stripped from it.  As a result, on this large node removing
one leaf would result in us calling update_suffix for every other leaf on
the node.

The fix is to just remove the calls to leaf_pull_suffix since they are
redundant as we already have a call in resize that will go through and
update the suffix length for the node before we exit out of
fib_table_flush or fib_table_flush_external.

Reported-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 14:29:11 -07:00
NeilBrown 743c69e7c0 sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.
The networking layer does not reliably report the distinction between
a non-block write failing because:
 1/ the queue is too full already and
 2/ a memory allocation attempt failed.

The distinction is important because in the first case it is
appropriate to retry as soon as the socket reports that it is
writable, and in the second case a small delay is required as the
socket will most likely report as writable but kmalloc could still
fail.

sk_stream_wait_memory() exhibits this distinction nicely, setting
'vm_wait' if a small wait is needed.  However in the non-blocking case
it always returns -EAGAIN no matter the cause of the failure.  This
-EAGAIN call get all the way to sunrpc.

The sunrpc layer expects EAGAIN to indicate the first cause, and
ENOBUFS to indicate the second.  Various documentation suggests that
this is not unreasonable, but does not guarantee the desired error
codes.

The result of getting -EAGAIN when -ENOBUFS is expected is that the
send is tried again in a tight loop and soft lockups are reported.

so: add tests after calls to xs_sendpages() to translate -EAGAIN into
-ENOBUFS if the socket is writable.  This cannot happen inside
xs_sendpages() as the test for "is socket writable" is different
between TCP and UDP.

With this change, the tight loop retrying xs_sendpages() becomes a
loop which only retries every 250ms, and so will not trigger a
soft-lockup warning.

It is possible that the write did fail because the queue was too full
and by the time xs_sendpages() completed, the queue was writable
again.  In this case an extra 250ms delay is inserted that isn't
really needed.  This circumstance suggests a degree of congestion so a
delay is not necessarily a bad thing, and it can only cause a single
250ms delay, not a series of them.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-27 11:16:56 -04:00
Dan Carpenter e11f40b935 lwtunnel: use kfree_skb() instead of vanilla kfree()
kfree_skb() is correct here.

Fixes: ffce41962e ('lwtunnel: support dst output redirect function')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:24:59 -07:00
Eric Dumazet 99d7662a04 tcp: tso: allow deferring under reordering state
While doing experiments with reordering resilience, we found
linux senders were not able to send at full speed under reordering,
because every incoming SACK was releasing one MSS.

This patch removes the limitation, as we did for CWR state
in commit a0ea700e40 ("tcp: tso: allow CA_CWR state in
tcp_tso_should_defer()")

Neal Cardwell had a concern about limited transmit so
Yuchung conducted experiments on GFE and found nothing
worth adding an extra check on fast path :

  if (icsk->icsk_ca_state == TCP_CA_Disorder &&
      tcp_sk(sk)->reordering == sysctl_tcp_reordering)
          goto send_now;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:23:20 -07:00
Martin KaFai Lau 8d6c31bf57 ipv6: Avoid rt6_probe() taking writer lock in the fast path
The patch checks neigh->nud_state before acquiring the writer lock.
Note that rt6_probe() is only used in CONFIG_IPV6_ROUTER_PREF.

40 udpflood processes and a /64 gateway route are used.
The gateway has NUD_PERMANENT.  Each of them is run for 30s.
At the end, the total number of finished sendto():

Before: 55M
After: 95M

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Julian Anastasov <ja@ssi.bg>
CC: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:08:25 -07:00
Martin KaFai Lau 990edb428c ipv6: Re-arrange code in rt6_probe()
It is a prep work for the next patch to remove write_lock
from rt6_probe().

1. Reduce the number of if(neigh) check.  From 4 to 1.
2. Bring the write_(un)lock() closer to the operations that the
   lock is protecting.

Hopefully, the above make rt6_probe() more readable.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:08:25 -07:00
Sabrina Dubroca dfbafc9953 tcp: fix recv with flags MSG_WAITALL | MSG_PEEK
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258
Reported-by: Enrico Scholz <rh-bugzilla@ensc.de>
Reported-by: Dan Searle <dan@censornet.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:06:53 -07:00
Nicolas Dichtel 5a6228a0b4 lwtunnel: change prototype of lwtunnel_state_get()
It saves some lines and simplify a bit the code when the state is returning
by this function. It's also useful to handle a NULL entry.

To avoid too long lines, I've also renamed lwtunnel_state_get() and
lwtunnel_state_put() to lwtstate_get() and lwtstate_put().

CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:02:49 -07:00
Nicolas Dichtel d943659508 ipv6: copy lwtstate in ip6_rt_copy_init()
We need to copy this field (ip6_rt_cache_alloc() and ip6_rt_pcpu_alloc()
use ip6_rt_copy_init() to build a dst).

CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: 19e42e4515 ("ipv6: support for fib route lwtunnel encap attributes")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:02:49 -07:00
Nicolas Dichtel 6673a9f4e3 ipv6: use lwtunnel_output6() only if flag redirect is set
This function make sense only when LWTUNNEL_STATE_OUTPUT_REDIRECT is set.
The check is already done in IPv4.

CC: Thomas Graf <tgraf@suug.ch>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Fixes: 74a0f2fe8e ("ipv6: rt6_info output redirect to tunnel output")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 01:00:51 -07:00
subashab@codeaurora.org b469139e81 dev: Spelling fix in comments
Fix the following typo
- unchainged -> unchanged

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 00:54:57 -07:00
David S. Miller 03de104f7b Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2015-07-23

Here's another one-patch pull request for 4.2 which targets a potential
NULL pointer dereference in the LE Security Manager code that can be
triggered by using older user space tools. The issue has been there
since 4.0 so there's the appropriate "Cc: stable" in place.

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:53:08 -07:00
Satish Ashok 949f1e39a6 bridge: mdb: notify on router port add and del
Send notifications on router port add and del/expire, re-use the already
existing MDBA_ROUTER and send NEWMDB/DELMDB netlink notifications
respectively.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:19:58 -07:00
Thomas Graf 597798e438 openvswitch: Retrieve tunnel metadata when receiving from vport-netdev
Retrieve the tunnel metadata for packets received by a net_device and
provide it to ovs_vport_receive() for flow key extraction.

[This hunk was in the GRE patch in the initial series and missed the
 cut for the initial submission for merging.]

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:19:11 -07:00
Nikolay Aleksandrov caaecdd3d3 inet: frags: remove INET_FRAG_EVICTED and use list_evictor for the test
We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags
race conditions with the evictor and use a participation test for the
evictor list, when we're at that point (after inet_frag_kill) in the
timer there're 2 possible cases:

1. The evictor added the entry to its evictor list while the timer was
waiting for the chainlock
or
2. The timer unchained the entry and the evictor won't see it

In both cases we should be able to see list_evictor correctly due
to the sync on the chainlock.

Joint work with Florian Westphal.

Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:00:15 -07:00
Florian Westphal 5719b296fb inet: frag: don't wait for timer deletion when evicting
Frank reports 'NMI watchdog: BUG: soft lockup' errors when
load is high.  Instead of (potentially) unbounded restarts of the
eviction process, just skip to the next entry.

One caveat is that, when a netns is exiting, a timer may still be running
by the time inet_evict_bucket returns.

We use the frag memory accounting to wait for outstanding timers,
so that when we free the percpu counter we can be sure no running
timer will trip over it.

Reported-and-tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:00:14 -07:00
Florian Westphal 0e60d245a0 inet: frag: change *_frag_mem_limit functions to take netns_frags as argument
Followup patch will call it after inet_frag_queue was freed, so q->net
doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would).

Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:00:14 -07:00
Florian Westphal d1fe19444d inet: frag: don't re-use chainlist for evictor
commit 65ba1f1ec0 ("inet: frags: fix a race between inet_evict_bucket
and inet_frag_kill") describes the bug, but the fix doesn't work reliably.

Problem is that ->flags member can be set on other cpu without chainlock
being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED
bit after we put the element on the evictor private list.

We can crash when walking the 'private' evictor list since an element can
be deleted from list underneath the evictor.

Join work with Nikolay Alexandrov.

Fixes: b13d3cbfb8 ("inet: frag: move eviction of queues to work queue")
Reported-by: Johan Schuijt <johan@transip.nl>
Tested-by: Frank Schreuder <fschreuder@transip.nl>
Signed-off-by: Nikolay Alexandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 21:00:14 -07:00
Nicolas Dichtel 2661371ace openvswitch: fix compilation when vxlan is a module
With CONFIG_VXLAN=m and CONFIG_OPENVSWITCH=y, there was the following
compilation error:
  LD      init/built-in.o
  net/built-in.o: In function `vxlan_tnl_create':
  .../net/openvswitch/vport-netdev.c:322: undefined reference to `vxlan_dev_create'
  make: *** [vmlinux] Error 1

CC: Thomas Graf <tgraf@suug.ch>
Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 20:56:55 -07:00
Julian Anastasov 88f6432036 ipv4: be more aggressive when probing alternative gateways
Currently, we do not notice if new alternative gateways
are added. We can do it by checking for present neigh
entry. Also, gateways that are currently probed (NUD_INCOMPLETE)
can be skipped from round-robin probing.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 20:56:27 -07:00
Wei-Chun Chao 48fb6b5545 ipv6: fix crash over flow-based vxlan device
Similar check was added in ip_rcv but not in ipv6_rcv.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81734e0a>] ipv6_rcv+0xfa/0x500
Call Trace:
[<ffffffff816c9786>] ? ip_rcv+0x296/0x400
[<ffffffff817732d2>] ? packet_rcv+0x52/0x410
[<ffffffff8168e99f>] __netif_receive_skb_core+0x63f/0x9a0
[<ffffffffc02b34a0>] ? br_handle_frame_finish+0x580/0x580 [bridge]
[<ffffffff8109912c>] ? update_rq_clock.part.81+0x1c/0x40
[<ffffffff8168ed18>] __netif_receive_skb+0x18/0x60
[<ffffffff8168fa1f>] process_backlog+0x9f/0x150

Fixes: ee122c79d4 (vxlan: Flow based tunneling)
Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 20:54:56 -07:00
Daniel Borkmann 81296fc673 net: sctp: stop spamming klog with rfc6458, 5.3.2. deprecation warnings
Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from
RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the
(as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d5e ("net:
sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1].

Imho, it was not a good idea, and we should just revert that message
for a couple of reasons:

  1) It's uapi and therefore set in stone forever.

  2) To be able to run on older and newer kernels, an SCTP application
     would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/
     SCTP_RCVINFO support, so that on older kernels, it can make use
     of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO.
     In my (limited) experience, a lot of SCTP appliances are migrating
     to newer kernels only ve(ee)ry slowly.

  3) Some people don't have the chance to change their applications,
     f.e. due to proprietary legacy stuff. So, they'll hit this warning
     in fast path and are stuck with older kernels.

But i.e. due to point 1) I really fail to see the benefit of a warning.
So just revert that for now, the issue was reported up Jamal.

  [1] http://thread.gmane.org/gmane.linux.network/321960/

Reported-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Michael Tuexen <tuexen@fh-muenster.de>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:32:41 -07:00
Jon Paul Maloy cda3696d3d tipc: clean up socket layer message reception
When a message is received in a socket, one of the call chains
tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
or
tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
are followed. At each of these levels we may encounter situations
where the message may need to be rejected, or a new message
produced for transfer back to the sender. Despite recent
improvements, the current code for doing this is perceived
as awkward and hard to follow.

Leveraging the two previous commits in this series, we now
introduce a more uniform handling of such situations. We
let each of the functions in the chain itself produce/reverse
the message to be returned to the sender, but also perform the
actual forwarding. This simplifies the necessary logics within
each function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:31:50 -07:00
Jon Paul Maloy bcd3ffd4f6 tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence

if (msg_reverse())
   tipc_link_xmit_skb()

at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.

In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:31:50 -07:00
Jon Paul Maloy 29042e19f2 tipc: let function tipc_msg_reverse() expand header when needed
The shortest TIPC message header, for cluster local CONNECTED messages,
is 24 bytes long. With this format, the fields "dest_node" and
"orig_node" are optimized away, since they in reality are redundant
in this particular case.

However, the absence of these fields leads to code inconsistencies
that are difficult to handle in some cases, especially when we need
to reverse or reject messages at the socket layer.

In this commit, we concentrate the handling of the absent fields
to one place, by letting the function tipc_msg_reverse() reallocate
the buffer and expand the header to 32 bytes when necessary. This
means that the socket code now can assume that the two previously
absent fields are present in the header when a message needs to be
rejected. This opens up for some further simplifications of the
socket code.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:31:50 -07:00
Nikolay Aleksandrov 963ad94853 bridge: netlink: fix slave_changelink/br_setport race conditions
Since slave_changelink support was added there have been a few race
conditions when using br_setport() since some of the port functions it
uses require the bridge lock. It is very easy to trigger a lockup due to
some internal spin_lock() usage without bh disabled, also it's possible to
get the bridge into an inconsistent state.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 3ac636b859 ("bridge: implement rtnl_link_ops->slave_changelink")
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:27:22 -07:00
David S. Miller 485164381c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains ten Netfilter/IPVS fixes, they are:

1) Address refcount leak when creating an expectation from the ctnetlink
   interface.

2) Fix bug splat in the IDLETIMER target related to sysfs, from Dmitry
   Torokhov.

3) Resolve panic for unreachable route in IPVS with locally generated
   traffic in the output path, from Alex Gartrell.

4) Fix wrong source address in rare cases for tunneled traffic in IPVS,
   from Julian Anastasov.

5) Fix crash if scheduler is changed via ipvsadm -E, again from Julian.

6) Make sure skb->sk is unset for forwarded traffic through IPVS, again from
   Alex Gartrell.

7) Fix crash with IPVS sync protocol v0 and FTP, from Julian.

8) Reset sender cpu for forwarded traffic in IPVS, also from Julian.

9) Allocate template conntracks through kmalloc() to resolve netns dependency
   problems with the conntrack kmem_cache.

10) Fix zones with expectations that clash using the same tuple, from Joe
    Stringer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-25 00:18:10 -07:00
Konstantin Khlebnikov cc9f4daa63 cgroup: net_cls: fix false-positive "suspicious RCU usage"
In dev_queue_xmit() net_cls protected with rcu-bh.

[  270.730026] ===============================
[  270.730029] [ INFO: suspicious RCU usage. ]
[  270.730033] 4.2.0-rc3+ #2 Not tainted
[  270.730036] -------------------------------
[  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[  270.730041] other info that might help us debug this:
[  270.730043] rcu_scheduler_active = 1, debug_locks = 1
[  270.730045] 2 locks held by dhclient/748:
[  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960
[  270.730085]  #1:  (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960
[  270.730090] stack backtrace:
[  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[  270.730108] Call Trace:
[  270.730121]  [<ffffffff817ad487>] dump_stack+0x4c/0x65
[  270.730148]  [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120
[  270.730153]  [<ffffffff816a62d2>] task_cls_state+0x92/0xa0
[  270.730158]  [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[  270.730164]  [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0
[  270.730166]  [<ffffffff816ab573>] tc_classify+0x33/0x90
[  270.730170]  [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb]
[  270.730172]  [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960
[  270.730174]  [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960
[  270.730176]  [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20
[  270.730185]  [<ffffffff81787770>] dev_queue_xmit+0x10/0x20
[  270.730187]  [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760
[  270.730190]  [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0
[  270.730203]  [<ffffffff81665245>] ? sock_def_readable+0x5/0x190
[  270.730210]  [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40
[  270.730216]  [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640
[  270.730219]  [<ffffffff8165f367>] sock_sendmsg+0x47/0x50
[  270.730221]  [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0
[  270.730232]  [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0
[  270.730234]  [<ffffffff811fe5b8>] vfs_write+0xb8/0x190
[  270.730236]  [<ffffffff811fe8c2>] SyS_write+0x52/0xb0
[  270.730239]  [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-25 00:13:18 -07:00
WANG Cong 77e62da6e6 sch_choke: drop all packets in queue during reset
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24 22:57:15 -07:00
WANG Cong fe6bea7f1f sch_plug: purge buffered packets during reset
Otherwise the skbuff related structures are not correctly
refcount'ed.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24 22:57:14 -07:00
Rosen, Rami 6ca91c6040 bridge: Fix setting a flag in br_fill_ifvlaninfo_range().
This patch fixes setting of vinfo.flags in the br_fill_ifvlaninfo_range() method. The
assignment of vinfo.flags &= ~BRIDGE_VLAN_INFO_RANGE_BEGIN has no effect and is
unneeded, as vinfo.flags value is overriden by the  immediately following
vinfo.flags = flags | BRIDGE_VLAN_INFO_RANGE_END assignement.

Signed-off-by: Rami Rosen <rami.rosen@intel.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24 22:56:22 -07:00
Julian Anastasov 2392debc2b ipv4: consider TOS in fib_select_default
fib_select_default considers alternative routes only when
res->fi is for the first alias in res->fa_head. In the
common case this can happen only when the initial lookup
matches the first alias with highest TOS value. This
prevents the alternative routes to require specific TOS.

This patch solves the problem as follows:

- routes that require specific TOS should be returned by
fib_select_default only when TOS matches, as already done
in fib_table_lookup. This rule implies that depending on the
TOS we can have many different lists of alternative gateways
and we have to keep the last used gateway (fa_default) in first
alias for the TOS instead of using single tb_default value.

- as the aliases are ordered by many keys (TOS desc,
fib_priority asc), we restrict the possible results to
routes with matching TOS and lowest metric (fib_priority)
and routes that match any TOS, again with lowest metric.

For example, packet with TOS 8 can not use gw3 (not lowest
metric), gw4 (different TOS) and gw6 (not lowest metric),
all other gateways can be used:

tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
tos 8 via gw2 metric 2
tos 8 via gw3 metric 3
tos 4 via gw4
tos 0 via gw5
tos 0 via gw6 metric 1

Reported-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24 22:46:11 -07:00
Julian Anastasov 18a912e9a8 ipv4: fib_select_default should match the prefix
fib_trie starting from 4.1 can link fib aliases from
different prefixes in same list. Make sure the alternative
gateways are in same table and for same prefix (0) by
checking tb_id and fa_slen.

Fixes: 79e5ad2ceb ("fib_trie: Remove leaf_info")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-24 22:46:09 -07:00
Linus Torvalds d1a343a023 virtio/vhost: fixes for 4.2
Bugfixes and documentation fixes. Igor's patch that allows
 users to tweak memory table size is borderline,
 but it does fix known crashes, so I merged it.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVpBzpAAoJECgfDbjSjVRpQXEIAMEqetqPuRduynjIw2HNktle
 fe/UUhvipwTsAM4R2pmcEl5tW04A/M54RkXN4iVy0rPshAfG3Fh4XTjLmzSGU0fI
 KD6qX8/Zc/+DUWnfe3aUC6jOOrjb7c4xRKOlQ9X8lZgi2M6AzrPeoZHFTtbX34CU
 2kcnv5Sb1SaI/t2SaCc2CaKilQalEOcd59gGjje2QXjidZnIVHwONrOyjBiINUy6
 GzfTgvAje8ZiG6951W3HDwwSfcqXezin27LJqnZgaLCwTCKt2gdQ2MAKjrfP2aVF
 +gX2B4XxcFLutMVx/obsjvA1ceipubyUauRLB3mnO3P5VOj1qbofa2lj4pzQ80k=
 =EKPr
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio/vhost fixes from Michael Tsirkin:
 "Bugfixes and documentation fixes.

  Igor's patch that allows users to tweak memory table size is
  borderline, but it does fix known crashes, so I merged it"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vhost: add max_mem_regions module parameter
  vhost: extend memory regions allocation to vmalloc
  9p/trans_virtio: reset virtio device on remove
  virtio/s390: rename drivers/s390/kvm -> drivers/s390/virtio
  MAINTAINERS: separate section for s390 virtio drivers
  virtio: define virtio_pci_cfg_cap in header.
  virtio: Fix typecast of pointer in vring_init()
  virtio scsi: fix unused variable warning
  vhost: use binary search instead of linear in find_region()
  virtio_net: document VIRTIO_NET_CTRL_GUEST_OFFLOADS
2015-07-23 13:07:04 -07:00
Jakub Pawlowski 9a0a8a8e85 Bluetooth: Move IRK checking logic in preparation to new connect method
Move IRK checking logic in preparation to new connect method. Also
make sure that MGMT_STATUS_INVALID_PARAMS is returned when non
identity address is passed to ADD_DEVICE. Right now MGMT_STATUS_FAILED
is returned, which might be misleading.

Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2015-07-23 17:10:51 +02:00
Dean Jenkins e432c72c46 Bluetooth: __l2cap_wait_ack() add defensive timeout
Add a timeout to prevent the do while loop running in an
infinite loop. This ensures that the channel will be
instructed to close within 10 seconds so prevents
l2cap_sock_shutdown() getting stuck forever.

Returns -ENOLINK when the timeout is reached. The channel
will be subequently closed and not all data will be ACK'ed.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:51 +02:00
Dean Jenkins cb02a25583 Bluetooth: __l2cap_wait_ack() use msecs_to_jiffies()
Use msecs_to_jiffies() instead of using HZ so that it
is easier to specify the time in milliseconds.

Also add a #define L2CAP_WAIT_ACK_POLL_PERIOD to specify the 200ms
polling period so that it is defined in a single place.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:51 +02:00
Dean Jenkins 451e4c6c6b Bluetooth: Add BT_DBG to l2cap_sock_shutdown()
Add helpful BT_DBG debug to l2cap_sock_shutdown()
and __l2cap_wait_ack() so that the code flow can
be analysed.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:51 +02:00
Dean Jenkins f65468f6e2 Bluetooth: Make __l2cap_wait_ack more efficient
Use chan->state instead of chan->conn because waiting
for ACK's is only possible in the BT_CONNECTED state.
Also avoids reference to the conn structure so makes
locking easier.

Only call __l2cap_wait_ack() when the needed condition
of chan->unacked_frames > 0 && chan->state == BT_CONNECTED
is true and convert the while loop to a do while loop.

__l2cap_wait_ack() change the function prototype to
pass in the chan variable as chan is already available
in the calling function l2cap_sock_shutdown(). Avoids
locking issues.

Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:51 +02:00
Dean Jenkins 2baea85dec Bluetooth: L2CAP ERTM shutdown protect sk and chan
During execution of l2cap_sock_shutdown() which might
sleep, the sk and chan structures can be in an unlocked
condition which potentially allows the structures to be
freed by other running threads. Therefore, there is a
possibility of a malfunction or memory reuse after being
freed.

Keep the sk and chan structures alive during the
execution of l2cap_sock_shutdown() by using their
respective hold and put functions. This allows the structures
to be freeable at the end of l2cap_sock_shutdown().

Signed-off-by: Kautuk Consul <Kautuk_Consul@mentor.com>
Signed-off-by: Dean Jenkins <Dean_Jenkins@mentor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:50 +02:00
Varka Bhadram d10270ce94 mac802154: fix ieee802154_rx handling
Instead of passing ieee802154_hw pointer to ieee802154_rx,
we can directly pass the ieee802154_local pointer.

Signed-off-by: Varka Bhadram <varkabhadram@gmail.com>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:50 +02:00
Varka Bhadram 729a8989b3 mac802154: do not export ieee802154_rx()
Right now there are no other users for ieee802154_rx()
in kernel. So lets remove EXPORT_SYMBOL() for this.

Also it moves the function prototype from global header
file to local header file.

Signed-off-by: Varka Bhadram <varkabhadram@gmail.com>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:50 +02:00
Alexander Aring 3cf24cf8c3 mac802154: cfg: add suspend and resume callbacks
This patch introduces suspend and resume callbacks to mac802154. When
doing suspend we calling the stop driver callback which should stop the
receiving of frames. A transceiver should go into low-power mode then.
Calling resume will call the start driver callback, which starts receiving
again and allow to transmit frames.

This was tested only with the fakelb driver and a qemu vm by doing the
following commands:

echo "devices" > /sys/power/pm_test
echo "freeze" > /sys/power/state

while doing some high traffic between two fakelb phys.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:49 +02:00
Varka Bhadram a6cb869b3b cfg802154: add PM hooks
This patch help to implement suspend/resume in mac802154, these
hooks will be run before the device is suspended and after it
resumes.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:49 +02:00
Alexander Aring c4227c8a62 mac802154: util: add stop_device utility function
This patch adds ieee802154_stop_device for preparing a utility function
to stop the ieee802154 device.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:49 +02:00
Varka Bhadram 927e031c7c mac802154: remove unused macro
This patch removes the unused macro which was removed
with the rework of linux-wpan kernel.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:48 +02:00
Varka Bhadram 8f451829dd mac802154: use WARN_ON() macro
This patch will generate the warning if the required driver ops
were not defined. Also it removes unnecessary debug message.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Acked-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:48 +02:00
Alexander Aring 301d7d42a2 6lowpan: add request for ipv6 module
The iphc module depends on CONFIG_IPV6, because it's not very useful to
build the module without IPv6 support. Recently an user reported about
issues for setting an IPv6 address to a 6LoWPAN interface. The issues
was solved by modprobe the ipv6 module before. To avoid such user issues
we try to request the ipv6 module when the 6LoWPAN module is loaded.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Acked-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:48 +02:00
Alexander Aring d77b4852b4 mac802154: add llsec address update workaround
This patch adds a workaround for using the new nl802154 netlink
interface with the old ieee802154 netlink interface togehter. The
nl802154 currently supports no access for llsec layer, currently there
are users outside which are using both interfaces at the same time. This
patch adds a necessary call when addresses are updated.

Reported-by: Simon Vincent <simon.vincent@xsilon.com>
Suggested-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2015-07-23 17:10:48 +02:00
Johan Hedberg 25ba265390 Bluetooth: Fix NULL pointer dereference in smp_conn_security
The l2cap_conn->smp pointer may be NULL for various valid reasons where SMP has
failed to initialize properly. One such scenario is when crypto support is
missing, another when the adapter has been powered on through a legacy method.
The smp_conn_security() function should have the appropriate check for this
situation to avoid NULL pointer dereferences.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org # 4.0+
2015-07-23 16:41:24 +02:00
Pablo Neira Ayuso 3bbd14e0a2 netfilter: rename local nf_hook_list to hook_list
085db2c045 ("netfilter: Per network namespace netfilter hooks.") introduced a
new nf_hook_list that is global, so let's avoid this overlap.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-23 16:18:35 +02:00
Pablo Neira Ayuso 7181ebafd4 netfilter: fix possible removal of wrong hook
nf_unregister_net_hook() uses the nf_hook_ops fields as tuple to look up for
the corresponding hook in the list. However, we may have two hooks with exactly
the same configuration.

This shouldn't be a problem for nftables since every new chain has an unique
priv field set, but this may still cause us problems in the future, so better
address this problem now by keeping a reference to the original nf_hook_ops
structure to make sure we delete the right hook from nf_unregister_net_hook().

Fixes: 085db2c045 ("netfilter: Per network namespace netfilter hooks.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-23 16:18:34 +02:00
Pablo Neira Ayuso 2385eb0c5f netfilter: nf_queue: fix nf_queue_nf_hook_drop()
This function reacquires the rtnl_lock() which is already held by
nf_unregister_hook().

This can be triggered via: modprobe nf_conntrack_ipv4 && rmmod nf_conntrack_ipv4

[  720.628746] INFO: task rmmod:3578 blocked for more than 120 seconds.
[  720.628749]       Not tainted 4.2.0-rc2+ #113
[  720.628752] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  720.628754] rmmod           D ffff8800ca46fd58     0  3578   3571 0x00000080
[...]
[  720.628783] Call Trace:
[  720.628790]  [<ffffffff8152ea0b>] schedule+0x6b/0x90
[  720.628795]  [<ffffffff8152ecb3>] schedule_preempt_disabled+0x13/0x20
[  720.628799]  [<ffffffff8152ff55>] mutex_lock_nested+0x1f5/0x380
[  720.628803]  [<ffffffff81462622>] ? rtnl_lock+0x12/0x20
[  720.628807]  [<ffffffff81462622>] ? rtnl_lock+0x12/0x20
[  720.628812]  [<ffffffff81462622>] rtnl_lock+0x12/0x20
[  720.628817]  [<ffffffff8148ab25>] nf_queue_nf_hook_drop+0x15/0x160
[  720.628825]  [<ffffffff81488d48>] nf_unregister_net_hook+0x168/0x190
[  720.628831]  [<ffffffff81488e24>] nf_unregister_hook+0x64/0x80
[  720.628837]  [<ffffffff81488e60>] nf_unregister_hooks+0x20/0x30
[...]

Moreover, nf_unregister_net_hook() should only destroy the queue for this
netns, not for every netns.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: 085db2c045 ("netfilter: Per network namespace netfilter hooks.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-23 16:17:58 +02:00
Thomas Graf 045a0fa0c5 ip_tunnel: Call ip_tunnel_core_init() from inet_init()
Convert the module_init() to a invocation from inet_init() since
ip_tunnel_core is part of the INET built-in.

Fixes: 3093fbe7ff ("route: Per route IP tunnel metadata via lightweight tunnel")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23 01:28:21 -07:00
David S. Miller c5e40ee287 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bridge/br_mdb.c

br_mdb.c conflict was a function call being removed to fix a bug in
'net' but whose signature was changed in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-23 00:41:16 -07:00
Linus Torvalds c5dfd654d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Don't use shared bluetooth antenna in iwlwifi driver for management
    frames, from Emmanuel Grumbach.

 2) Fix device ID check in ath9k driver, from Felix Fietkau.

 3) Off by one in xen-netback BUG checks, from Dan Carpenter.

 4) Fix IFLA_VF_PORT netlink attribute validation, from Daniel Borkmann.

 5) Fix races in setting peeked bit flag in SKBs during datagram
    receive.  If it's shared we have to clone it otherwise the value can
    easily be corrupted.  Fix from Herbert Xu.

 6) Revert fec clock handling change, causes regressions.  From Fabio
    Estevam.

 7) Fix use after free in fq_codel and sfq packet schedulers, from WANG
    Cong.

 8) ipvlan bug fixes (memory leaks, missing rcu_dereference_bh, etc.)
    from WANG Cong and Konstantin Khlebnikov.

 9) Memory leak in act_bpf packet action, from Alexei Starovoitov.

10) ARM bpf JIT bug fixes from Nicolas Schichan.

11) Fix backwards compat of ANY_LAYOUT in virtio_net driver, from
    Michael S Tsirkin.

12) Destruction of bond with different ARP header types not handled
    correctly, fix from Nikolay Aleksandrov.

13) Revert GRO receive support in ipv6 SIT tunnel driver, causes
    regressions because the GRO packets created cannot be processed
    properly on the GSO side if we forward the frame.  From Herbert Xu.

14) TCCR update race and other fixes to ravb driver from Sergei
    Shtylyov.

15) Fix SKB leaks in caif_queue_rcv_skb(), from Eric Dumazet.

16) Fix panics on packet scheduler filter replace, from Daniel Borkmann.

17) Make sure AF_PACKET sees properly IP headers in defragmented frames
    (via PACKET_FANOUT_FLAG_DEFRAG option), from Edward Hyunkoo Jee.

18) AF_NETLINK cannot hold mutex in RCU callback, fix from Florian
    Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (84 commits)
  ravb: fix ring memory allocation
  net: phy: dp83867: Fix warning check for setting the internal delay
  openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
  netlink: don't hold mutex in rcu callback when releasing mmapd ring
  ARM: net: fix vlan access instructions in ARM JIT.
  ARM: net: handle negative offsets in BPF JIT.
  ARM: net: fix condition for load_order > 0 when translating load instructions.
  tcp: suppress a division by zero warning
  drivers: net: cpsw: remove tx event processing in rx napi poll
  inet: frags: fix defragmented packet's IP header for af_packet
  net: mvneta: fix refilling for Rx DMA buffers
  stmmac: fix setting of driver data in stmmac_dvr_probe
  sched: cls_flow: fix panic on filter replace
  sched: cls_flower: fix panic on filter replace
  sched: cls_bpf: fix panic on filter replace
  net/mdio: fix mdio_bus_match for c45 PHY
  net: ratelimit warnings about dst entry refcount underflow or overflow
  caif: fix leaks and race in caif_queue_rcv_skb()
  qmi_wwan: add the second QMI/network interface for Sierra Wireless MC7305/MC7355
  ravb: fix race updating TCCR
  ...
2015-07-22 14:45:25 -07:00
Trond Myklebust 1980bd4d82 SUNRPC: xprt_complete_bc_request must also decrement the free slot count
Calling xprt_complete_bc_request() effectively causes the slot to be allocated,
so it needs to decrement the backchannel free slot count as well.

Fixes: 0d2a970d0a ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 17:10:50 -04:00
Trond Myklebust 68514471ce SUNRPC: Fix a backchannel deadlock
xprt_alloc_bc_request() cannot call xprt_free_bc_request() without
deadlocking, since it already holds the xprt->bc_pa_lock.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: 0d2a970d0a ("SUNRPC: Fix a backchannel race")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-22 16:33:59 -04:00
Erik Kline 3985e8a361 ipv6: sysctl to restrict candidate source addresses
Per RFC 6724, section 4, "Candidate Source Addresses":

    It is RECOMMENDED that the candidate source addresses be the set
    of unicast addresses assigned to the interface that will be used
    to send to the destination (the "outgoing" interface).

Add a sysctl to enable this behaviour.

Signed-off-by: Erik Kline <ek@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 10:54:11 -07:00
Roopa Prabhu de18547d48 mpls_iptunnel: fix sparse warn: remove incorrect rcu_dereference
fix for:
net/mpls/mpls_iptunnel.c:73:19: sparse: incompatible types in comparison
expression (different address spaces)

remove incorrect rcu_dereference possibly left over from
earlier revisions of the code.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-22 10:50:23 -07:00
Joe Stringer 4b31814d20 netfilter: nf_conntrack: Support expectations in different zones
When zones were originally introduced, the expectation functions were
all extended to perform lookup using the zone. However, insertion was
not modified to check the zone. This means that two expectations which
are intended to apply for different connections that have the same tuple
but exist in different zones cannot both be tracked.

Fixes: 5d0aa2ccd4 (netfilter: nf_conntrack: add support for "conntrack zones")
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-22 17:00:47 +02:00
Rick Jones b56ea2985d net: track success and failure of TCP PMTU probing
Track success and failure of TCP PMTU probing.

Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 22:36:33 -07:00
Roopa Prabhu 01faef2ceb mpls: make RTA_OIF optional
If user did not specify an oif, try and get it from the via address.
If failed to get device, return with -ENODEV.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 22:27:19 -07:00
Chris J Arges bac541e463 openvswitch: allocate nr_node_ids flow_stats instead of num_possible_nodes
Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow->stats[node].

Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().

The crash was noticed after 3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f207

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 22:26:03 -07:00
Florian Westphal 0470eb99b4 netlink: don't hold mutex in rcu callback when releasing mmapd ring
Kirill A. Shutemov says:

This simple test-case trigers few locking asserts in kernel:

int main(int argc, char **argv)
{
        unsigned int block_size = 16 * 4096;
        struct nl_mmap_req req = {
                .nm_block_size          = block_size,
                .nm_block_nr            = 64,
                .nm_frame_size          = 16384,
                .nm_frame_nr            = 64 * block_size / 16384,
        };
        unsigned int ring_size;
	int fd;

	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
                exit(1);
        if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
                exit(1);

	ring_size = req.nm_block_nr * req.nm_block_size;
	mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	return 0;
}

+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
 #0:  (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220
 #1:  ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70
 #2:  (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20

CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
 <IRQ>  [<ffffffff81929ceb>] dump_stack+0x4f/0x7b
 [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270
 [<ffffffff81085bed>] __might_sleep+0x4d/0x90
 [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430
 [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
 [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20
 [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350
 [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70
 [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150
 [<ffffffff817e484d>] __sk_free+0x1d/0x160
 [<ffffffff817e49a9>] sk_free+0x19/0x20
[..]

Cong Wang says:

We can't hold mutex lock in a rcu callback, [..]

Thomas Graf says:

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.

Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Diagnosed-by: Cong Wang <cwang@twopensource.com>
Suggested-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 22:22:56 -07:00
Eric Dumazet 89e478a2aa tcp: suppress a division by zero warning
Andrew Morton reported following warning on one ARM build
with gcc-4.4 :

net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
net/ipv4/inet_hashtables.c:617: warning: division by zero

Even guarded with a test on sizeof(spinlock_t), compiler does not
like current construct on a !CONFIG_SMP build.

Remove the warning by using a temporary variable.

Fixes: 095dc8e0c3 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 22:13:13 -07:00
Jon Paul Maloy 16040894b2 tipc: fix compatibility bug
In commit d999297c3d
("tipc: reduce locking scope during packet reception") we introduced
a new function tipc_link_proto_rcv(). This function contains a bug,
so that it sometimes by error sends out a non-zero link priority value
in created protocol messages.

The bug may lead to an extra link reset at initial link establising
with older nodes. This will never happen more than once, whereafter
the link will work as intended.

We fix this bug in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 16:23:50 -07:00
Mathias Krause e181a54304 net: #ifdefify sk_classid member of struct sock
The sk_classid member is only required when CONFIG_CGROUP_NET_CLASSID is
enabled. #ifdefify it to reduce the size of struct sock on 32 bit
systems, at least.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 16:04:30 -07:00
Thomas Graf 614732eaa1 openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.

Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:07 -07:00
Thomas Graf c9db965c52 openvswitch: Abstract vport name through ovs_vport_name()
This allows to get rid of the get_name() vport ops later on.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:07 -07:00
Thomas Graf be4ace6e6b openvswitch: Move dev pointer into vport itself
This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:07 -07:00
Thomas Graf 34ae932a40 openvswitch: Make tunnel set action attach a metadata dst
Utilize the new metadata dst to attach encapsulation instructions to
the skb. The existing egress_tun_info via the OVS_CB() is left in
place until all tunnel vports have been converted to the new method.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf e7030878fc fib: Add fib rule match on tunnel id
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.

ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200

A new static key controls the collection of metadata at tunnel level
upon demand.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf 3093fbe7ff route: Per route IP tunnel metadata via lightweight tunnel
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf 1b7179d3ad route: Extend flow representation with tunnel key
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
allow routes to match on tunnel metadata. For now, the tunnel id is
added to flowi_tunnel which allows for routes to be bound to specific
virtual tunnels.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:06 -07:00
Thomas Graf 0accfc268f arp: Inherit metadata dst when creating ARP requests
If output device wants to see the dst, inherit the dst of the
original skb and pass it on to generate the ARP request.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
Thomas Graf f38a9eb1f7 dst: Metadata destinations
Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.

The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.

The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.

In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.

The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
Thomas Graf 773a69d64b icmp: Don't leak original dst into ip_route_input()
ip_route_input() unconditionally overwrites the dst. Hide the original
dst attached to the skb by calling skb_dst_set(skb, NULL) prior to
ip_route_input().

Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
Thomas Graf 1d8fff9073 ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic
Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.

Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
Roopa Prabhu e3e4712ec0 mpls: ip tunnel support
This implementation uses lwtunnel infrastructure to register
hooks for mpls tunnel encaps.

It picks cues from iptunnel_encaps infrastructure and previous
mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
Roopa Prabhu face0188e3 mpls: export mpls functions for use by mpls iptunnels
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:04 -07:00
Roopa Prabhu 74a0f2fe8e ipv6: rt6_info output redirect to tunnel output
This is similar to ipv4 redirect of dst output to lwtunnel
output function for encapsulation and xmit.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:04 -07:00
Roopa Prabhu 8602a62502 ipv4: redirect dst output to lwtunnel output
For input routes with tunnel encap state this patch redirects
dst output functions to lwtunnel_output which later resolves to
the corresponding lwtunnel output function.

This has been tested to work with mpls ip tunnels.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:04 -07:00
Roopa Prabhu ffce41962e lwtunnel: support dst output redirect function
This patch introduces lwtunnel_output function to call corresponding
lwtunnels output function to xmit the packet.

It adds two variants lwtunnel_output and lwtunnel_output6 for ipv4 and
ipv6 respectively today. But this is subject to change when lwtstate will
reside in dst or dst_metadata (as per upstream discussions).

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:04 -07:00
Roopa Prabhu 19e42e4515 ipv6: support for fib route lwtunnel encap attributes
This patch adds support in ipv6 fib functions to parse Netlink
RTA encap attributes and attach encap state data to rt6_info.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:04 -07:00
Roopa Prabhu 571e722676 ipv4: support for fib route lwtunnel encap attributes
This patch adds support in ipv4 fib functions to parse user
provided encap attributes and attach encap state data to fib_nh
and rtable.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:03 -07:00
Roopa Prabhu 499a242568 lwtunnel: infrastructure for handling light weight tunnels like mpls
Provides infrastructure to parse/dump/store encap information for
light weight tunnels like mpls. Encap information for such tunnels
is associated with fib routes.

This infrastructure is based on previous suggestions from
Eric Biederman to follow the xfrm infrastructure.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:03 -07:00
Edward Hyunkoo Jee 0848f6428b inet: frags: fix defragmented packet's IP header for af_packet
When ip_frag_queue() computes positions, it assumes that the passed
sk_buff does not contain L2 headers.

However, when PACKET_FANOUT_FLAG_DEFRAG is used, IP reassembly
functions can be called on outgoing packets that contain L2 headers.

Also, IPv4 checksum is not corrected after reassembly.

Fixes: 7736d33f42 ("packet: Add pre-defragmentation support for ipv4 fanouts.")
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:29:23 -07:00
Jakub Wilk 0c199a903d xfrm: Fix a typo
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:28:36 -07:00
Daniel Borkmann 32b2f4b196 sched: cls_flow: fix panic on filter replace
The following test case causes a NULL pointer dereference in cls_flow:

  tc filter add dev foo parent 1: handle 0x1 flow hash keys dst action ok
  tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
            flow hash keys mark action drop

To be more precise, actually two different panics are fixed, the first
occurs because tcf_exts_init() is not called on the newly allocated
filter when we do a replace. And the second panic uncovered after that
happens since the arguments of list_replace_rcu() are swapped, the old
element needs to be the first argument and the new element the second.

Fixes: 70da9f0bf9 ("net: sched: cls_flow use RCU")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:25:03 -07:00
Daniel Borkmann ff3532f265 sched: cls_flower: fix panic on filter replace
The following test case causes a NULL pointer dereference in cls_flower:

  tc filter add dev foo parent 1: flower eth_type ipv4 action ok flowid 1:1
  tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
            flower eth_type ipv6 action ok flowid 1:1

The problem is that commit 77b9900ef5 ("tc: introduce Flower classifier")
accidentally swapped the arguments of list_replace_rcu(), the old
element needs to be the first argument and the new element the second.

Fixes: 77b9900ef5 ("tc: introduce Flower classifier")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:25:03 -07:00
Daniel Borkmann f6bfc46da6 sched: cls_bpf: fix panic on filter replace
The following test case causes a NULL pointer dereference in cls_bpf:

  FOO="1,6 0 0 4294967295,"
  tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok
  tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
            bpf bytecode "$FOO" flowid 1:1 action drop

The problem is that commit 1f947bf151 ("net: sched: rcu'ify cls_bpf")
accidentally swapped the arguments of list_replace_rcu(), the old
element needs to be the first argument and the new element the second.

Fixes: 1f947bf151 ("net: sched: rcu'ify cls_bpf")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:25:02 -07:00
Marcelo Ricardo Leitner b52effd263 sctp: fix cut and paste issue in comment
Cookie ACK is always received by the association initiator, so fix the
comment to avoid confusion.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:21:32 -07:00
Marcelo Ricardo Leitner 0ca50d12fe sctp: fix src address selection if using secondary addresses
In short, sctp is likely to incorrectly choose src address if socket is
bound to secondary addresses. This patch fixes it by adding a new check
that checks if such src address belongs to the interface that routing
identified as output.

This is enough to avoid rp_filter drops on remote peer.

Details:

Currently, sctp will do a routing attempt without specifying the src
address and compare the returned value (preferred source) with the
addresses that the socket is bound to. When using secondary addresses,
this will not match.

Then it will try specifying each of the addresses that the socket is
bound to and re-routing, checking if that address is valid as src for
that dst. Thing is, this check alone is weak:

# ip r l
192.168.100.0/24 dev eth1  proto kernel  scope link  src 192.168.100.149
192.168.122.0/24 dev eth0  proto kernel  scope link  src 192.168.122.147

# ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:15:18:6a brd ff:ff:ff:ff:ff:ff
    inet 192.168.122.147/24 brd 192.168.122.255 scope global dynamic eth0
       valid_lft 2160sec preferred_lft 2160sec
    inet 192.168.122.148/24 scope global secondary eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe15:186a/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:b3:91:46 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.149/24 brd 192.168.100.255 scope global dynamic eth1
       valid_lft 2162sec preferred_lft 2162sec
    inet 192.168.100.148/24 scope global secondary eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:feb3:9146/64 scope link
       valid_lft forever preferred_lft forever
4: ens9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:05:47:ee brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe05:47ee/64 scope link
       valid_lft forever preferred_lft forever

# ip r g 192.168.100.193 from 192.168.122.148
192.168.100.193 from 192.168.122.148 dev eth1
    cache

Even if you specify an interface:

# ip r g 192.168.100.193 from 192.168.122.148 oif eth1
192.168.100.193 from 192.168.122.148 dev eth1
    cache

Although this would be valid, peers using rp_filter will drop such
packets as their src doesn't match the routes for that interface.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:20:17 -07:00
Marcelo Ricardo Leitner 07868284e5 sctp: reduce indent level on sctp_v4_get_dst
Paves the day for the next patch. Functionality stays untouched.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:20:17 -07:00
David S. Miller 27dfead164 Some fixes for the current cycle:
1. Arik introduced an rtnl-locked regulatory API to be able
     to differentiate between place do/don't have the RTNL;
     this fixes missing locking in some of the code paths
 
  2. Two small mesh bugfixes from Bob, one to avoid treating
     a certain malformed over-the-air frame and one to avoid
     sending a garbage field over the air.
 
  3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.
 
  4. A fix for a powersave vs. aggregation teardown race, from Michal.
 
  5. Thomas reduced the loglevel of CRDA messages to avoid spamming
     the kernel log with mostly irrelevant information.
 
  6. Tom fixed a dangling debugfs directory pointer that could cause
     crashes if subsequent addition of the same interface to debugfs
     failed for some reason.
 
  7. A fix from myself for a list corruption issue in mac80211 during
     combined interface shutdown/removal - shut down interfaces first
     and only then remove them to avoid that.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJVqQMwAAoJEDBSmw7B7bqrZzkQAIjMKojlJRouN/N/aF7ym2pC
 eAboLMC+XHubQoq2H01k5ZOSrLL1kElhkB7pLas+q00gTFyavLzEcEiFqCNuLwPH
 lQEwLXTDUeiaVWekOYJev/ONtaDdwUXoB4BPAA3Ih4EAk9fEBtcWiWeLDgOLOS8P
 eYVqcMV733cOTjhYImEQnhnm3qrcwSCF1vTOJaN4Gf/qqw6j2ilq5wU1TvPyh0TA
 1EP5Elb9hy1sud5X6shrsOBqkBrPoO1p3V4EeoHkxl8welqxXdqGvmA3K0sloGZT
 7RiL8PD4QVyISy1NrBDnNMRRgj6BD1aLC+clmECmmgYvGGcqbzLtB3CWUCV6oQmb
 TC4NmgJkKNVTvQnoqxQEL8JYSs/E2ITRKyMi3sfIYAyz1dVuQf1RkZZzB22rQWT2
 PaLq/k+vpS7E3OD3XB53flB/k7Y6j/OwJb/rE7i2vqSn3kcbua8H7dzd7p+AE5FA
 ZF//u2GBDgZeMaA9BvifByWy2+yvAEcD5/U9XkWqJ7t+HohKteLJj/scHT89pto3
 n0NZ7RVRMNQ9mz14UJiVnFOL/81AjmiU123S5UIIMkmVE5Zrn7VTZlN6fVY4Fcsh
 AtxHQesOlCw8T4lFLxgyKkEl7bxATQ2OMR6vWsZQraRHSqIuK8JDABRokIlzoFn/
 xC/Yn1vTaBuj+2nif/F0
 =US5Y
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Some fixes for the current cycle:

 1. Arik introduced an rtnl-locked regulatory API to be able
    to differentiate between place do/don't have the RTNL;
    this fixes missing locking in some of the code paths

 2. Two small mesh bugfixes from Bob, one to avoid treating
    a certain malformed over-the-air frame and one to avoid
    sending a garbage field over the air.

 3. A fix for powersave during WoWLAN suspend from Krishna Chaitanya.

 4. A fix for a powersave vs. aggregation teardown race, from Michal.

 5. Thomas reduced the loglevel of CRDA messages to avoid spamming
    the kernel log with mostly irrelevant information.

 6. Tom fixed a dangling debugfs directory pointer that could cause
    crashes if subsequent addition of the same interface to debugfs
    failed for some reason.

 7. A fix from myself for a list corruption issue in mac80211 during
    combined interface shutdown/removal - shut down interfaces first
    and only then remove them to avoid that.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:17:53 -07:00
Konstantin Khlebnikov 8bf4ada2e2 net: ratelimit warnings about dst entry refcount underflow or overflow
Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:11:19 -07:00
Eric Dumazet b8a23e8d8e caif: fix leaks and race in caif_queue_rcv_skb()
1) If sk_filter() is applied, skb was leaked (not freed)
2) Testing SOCK_DEAD twice is racy :
   packet could be freed while already queued.
3) Remove obsolete comment about caching skb->len

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 00:02:44 -07:00
Alexei Starovoitov 4d9c5c53ac test_bpf: add bpf_skb_vlan_push/pop() tests
improve accuracy of timing in test_bpf and add two stress tests:
- {skb->data[0], get_smp_processor_id} repeated 2k times
- {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68

1st test is useful to test performance of JIT implementation of BPF_LD_ABS
together with BPF_CALL instructions.
2nd test is stressing skb_vlan_push/pop logic together with skb->data access
via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly.

In order to call bpf_skb_vlan_push() from test_bpf.ko have to add
three export_symbol_gpl.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:52:32 -07:00
Alexei Starovoitov 4e10df9a60 bpf: introduce bpf_skb_vlan_push/pop() helpers
Allow eBPF programs attached to TC qdiscs call skb_vlan_push/pop via
helper functions. These functions may change skb->data/hlen which are
cached by some JITs to improve performance of ld_abs/ld_ind instructions.
Therefore JITs need to recognize bpf_skb_vlan_push/pop() calls,
re-compute header len and re-cache skb->data/hlen back into cpu registers.
Note, skb->data/hlen are not directly accessible from the programs,
so any changes to skb->data done either by these helpers or by other
TC actions are safe.

eBPF JIT supported by three architectures:
- arm64 JIT is using bpf_load_pointer() without caching, so it's ok as-is.
- x64 JIT re-caches skb->data/hlen unconditionally after vlan_push/pop calls
  (experiments showed that conditional re-caching is slower).
- s390 JIT falls back to interpreter for now when bpf_skb_vlan_push() is present
  in the program (re-caching is tbd).

These helpers allow more scalable handling of vlan from the programs.
Instead of creating thousands of vlan netdevs on top of eth0 and attaching
TC+ingress+bpf to all of them, the program can be attached to eth0 directly
and manipulate vlans as necessary.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:52:31 -07:00
Jon Paul Maloy d999297c3d tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:

We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.

Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.

This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 1a20cc254e tipc: introduce node contact FSM
The logics for determining when a node is permitted to establish
and maintain contact with its peer node becomes non-trivial in the
presence of multiple parallel links that may come and go independently.

A known failure scenario is that one endpoint registers both its links
to the peer lost, cleans up it binding table, and prepares for a table
update once contact is re-establihed, while the other endpoint may
see its links reset and re-established one by one, hence seeing
no need to re-synchronize the binding table. To avoid this, a node
must not allow re-establishing contact until it has confirmation that
even the peer has lost both links.

Currently, the mechanism for handling this consists of setting and
resetting two state flags from different locations in the code. This
solution is hard to understand and maintain. A closer analysis even
reveals that it is not completely safe.

In this commit we do instead introduce an FSM that keeps track of
the conditions for when the node can establish and maintain links.
It has six states and four events, and is strictly based on explicit
knowledge about the own node's and the peer node's contact states.
Only events leading to state change are shown as edges in the figure
below.

                             +--------------+
                             | SELF_UP/     |
           +---------------->| PEER_COMING  |-----------------+
    SELF_  |                 +--------------+                 |PEER_
    ESTBL_ |                        |                         |ESTBL_
    CONTACT|      SELF_LOST_CONTACT |                         |CONTACT
           |                        v                         |
           |                 +--------------+                 |
           |      PEER_      | SELF_DOWN/   |     SELF_       |
           |      LOST_   +--| PEER_LEAVING |<--+ LOST_       v
+-------------+   CONTACT |  +--------------+   | CONTACT  +-----------+
| SELF_DOWN/  |<----------+                     +----------| SELF_UP/  |
| PEER_DOWN   |<----------+                     +----------| PEER_UP   |
+-------------+   SELF_   |  +--------------+   | PEER_    +-----------+
           |      LOST_   +--| SELF_LEAVING/|<--+ LOST_       A
           |      CONTACT    | PEER_DOWN    |     CONTACT     |
           |                 +--------------+                 |
           |                         A                        |
    PEER_  |       PEER_LOST_CONTACT |                        |SELF_
    ESTBL_ |                         |                        |ESTBL_
    CONTACT|                 +--------------+                 |CONTACT
           +---------------->| PEER_UP/     |-----------------+
                             | SELF_COMING  |
                             +--------------+

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 8a1577c96f tipc: move link supervision timer to node level
In our effort to move control of the links to the link aggregation
layer, we move the perodic link supervision timer to struct tipc_node.
The new timer is shared between all links belonging to the node, thus
saving resources, while still kicking the FSM on both its pertaining
links at each expiration.

The current link timer and corresponding functions are removed.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 333ef69ed2 tipc: simplify link timer implementation
We create a second, simpler, link timer function, tipc_link_timeout().
The new function  makes use of the new FSM function introduced in the
previous commit, and just like it, takes a buffer queue as parameter.
It returns an event bit field and potentially a link protocol packet
to the caller.

The existing timer function, link_timeout(), is still needed for a
while, so we redesign it to become a wrapper around the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 6ab30f9cbe tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.

In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.

Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.

We also rename the link states as follows:

WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN   -> TIPC_LINK_RESETTING
RESET_RESET     -> TIPC_LINK_ESTABLISHING

The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy 426cc2b86d tipc: introduce new link protocol msg create function
As a preparation for later changes, we introduce a new function
tipc_link_build_proto_msg(). Instead of actually sending the created
protocol message, it only creates it and adds it to the head of a
skb queue provided by the caller.

Since we still need the existing function tipc_link_protocol_xmit()
for a while, we redesign it to make use of the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy d3504c3449 tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().

We eliminate these flags in this commit.

Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.

Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy af9b028e27 tipc: make media xmit call outside node spinlock context
Currently, message sending is performed through a deep call chain,
where the node spinlock is grabbed and held during a significant
part of the transmission time. This is clearly detrimental to
overall throughput performance; it would be better if we could send
the message after the spinlock has been released.

In this commit, we do instead let the call revert on the stack after
the buffer chain has been added to the transmission queue, whereafter
clones of the buffers are transmitted to the device layer outside the
spinlock scope.

As a further step in our effort to separate the roles of the node
and link entities we also move the function tipc_link_xmit() to
node.c, and rename it to tipc_node_xmit().

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy 22d85c7942 tipc: change sk_buffer handling in tipc_link_xmit()
When the function tipc_link_xmit() is given a buffer list for
transmission, it currently consumes the list both when transmission
is successful and when it fails, except for the special case when
it encounters link congestion.

This behavior is inconsistent, and needs to be corrected if we want
to avoid problems in later commits in this series.

In this commit, we change this to let the function consume the list
only when transmission is successful, and leave the list with the
sender in all other cases. We also modifiy the socket code so that
it adapts to this change, i.e., purges the list when a non-congestion
error code is returned.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy 36e78a463b tipc: use bearer index when looking up active links
struct tipc_node currently holds two arrays of link pointers; one,
indexed by bearer identity, which contains all links irrespective of
current state, and one two-slot array for the currently active link
or links. The latter array contains direct pointers into the elements
of the former. This has the effect that we cannot know the bearer id of
a link when accessing it via the "active_links[]" array without actually
dereferencing the pointer, something we want to avoid in some cases.

In this commit, we do instead store the bearer identity in the
"active_links" array, and use this as an index to find the right element
in the overall link entry array. This change should be seen as a
preparation for the later commits in this series.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy d39bbd445d tipc: move link input queue to tipc_node
At present, the link input queue and the name distributor receive
queues are fields aggregated in struct tipc_link. This is a hazard,
because a link might be deleted while a receiving socket still keeps
reference to one of the queues.

This commit fixes this bug. However, rather than adding yet another
reference counter to the critical data path, we move the two queues
to safe ground inside struct tipc_node, which is already protected, and
let the link code only handle references to the queues. This is also
in line with planned later changes in this area.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy d3a43b907a tipc: move link creation from neighbor discoverer to node
As a step towards turning links into node internal entities, we move the
creation of links from the neighbor discovery logics to the node's link
control logics.

We also create an additional entry for the link's media address in the
newly introduced struct tipc_link_entry, since this is where it is
needed in the upcoming commits. The current copy in struct tipc_link
is kept for now, but will be removed later.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy 9d13ec65ed tipc: introduce link entry structure to struct tipc_node
struct 'tipc_node' currently contains two arrays for link attributes,
one for the link pointers, and one for the usable link MTUs.

We now group those into a new struct 'tipc_link_entry', and intoduce
one single array consisting of such enties. Apart from being a cosmetic
improvement, this is a starting point for the strict master-slave
relation between node and link that we will introduce in the following
commits.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Scott Feldman 1a3b2ec93d switchdev: add offload_fwd_mark generator helper
skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be
unique for device and may even be unique for a sub-set of ports within
device, so add switchdev helper function to generate unique marks based on
port's switch ID and group_ifindex.  group_ifindex would typically be the
container dev's ifindex, such as the bridge's ifindex.

The generator uses a global hash table to store offload_fwd_marks hashed by
{switch ID, group_ifindex} key.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Scott Feldman d754f98b50 net: add phys ID compare helper to test if two IDs are the same
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Scott Feldman 0c4f691ff6 net: don't reforward packets already forwarded by offload device
Just before queuing skb for xmit on port, check if skb has been marked by
switchdev port driver as already fordwarded by device.  If so, drop skb.  A
non-zero skb->offload_fwd_mark field is set by the switchdev port
driver/device on ingress to indicate the skb has already been forwarded by
the device to egress ports with matching dev->skb_mark.  The switchdev port
driver would assign a non-zero dev->offload_skb_mark for each device port
netdev during registration, for example.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Herbert Xu fdbf5b097b Revert "sit: Add gro callbacks to sit_offload"
This patch reverts 19424e052f ("sit:
Add gro callbacks to sit_offload") because it generates packets
that cannot be handled even by our own GSO.

Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 16:52:28 -07:00
Nikolay Aleksandrov a7ce45a74b bridge: mcast: fix br_multicast_dev_del warn when igmp snooping is not defined
Fix:
net/bridge/br_if.c: In function 'br_dev_delete':
>> net/bridge/br_if.c:284:2: error: implicit declaration of function
>> 'br_multicast_dev_del' [-Werror=implicit-function-declaration]
     br_multicast_dev_del(br);
     ^
   cc1: some warnings being treated as errors

when igmp snooping is not defined.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 16:19:03 -07:00
Phil Sutter a0a9f33bdf net/ipv6: update flowi6_oif in ip6_dst_lookup_flow if not set
Newly created flows don't have flowi6_oif set (at least if the
associated socket is not interface-bound). This leads to a mismatch in
__xfrm6_selector_match() for policies which specify an interface in the
selector (sel->ifindex != 0).

Backtracing shows this happens in code-paths originating from e.g.
ip6_datagram_connect(), rawv6_sendmsg() or tcp_v6_connect(). (UDP was
not tested for.)

In summary, this patch fixes policy matching on outgoing interface for
locally generated packets.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:59:32 -07:00
Satish Ashok e10177abf8 bridge: multicast: fix handling of temp and perm entries
When the bridge (or port) is brought down/up flush only temp entries and
leave the perm ones. Flush perm entries only when deleting the bridge
device or the associated port.

Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:49:10 -07:00
Nikolay Aleksandrov ef8299de7e bridge: multicast: notify on group delete
Group notifications were not sent when a group expired or was deleted
due to bridge/port device being deleted. So add br_mdb_notify() to
br_multicast_del_pg().

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:49:10 -07:00
Daniel Borkmann 8d20aabe1c ebpf: add helper to retrieve net_cls's classid cookie
It would be very useful to retrieve the net_cls's classid from an eBPF
program to allow for a more fine-grained classification, it could be
directly used or in conjunction with additional policies. I.e. docker,
but also tooling such as cgexec, can easily run applications via net_cls
cgroups:

  cgcreate -g net_cls:/foo
  echo 42 > foo/net_cls.classid
  cgexec -g net_cls:foo <prog>

Thus, their respecitve classid cookie of foo can then be looked up on
the egress path to apply further policies. The helper is desigend such
that a non-zero value returns the cgroup id.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:41:30 -07:00
Daniel Borkmann b87a173e25 cls_cgroup: factor out classid retrieval
Split out retrieving the cgroups net_cls classid retrieval into its
own function, so that it can be reused later on from other parts of
the traffic control subsystem. If there's no skb->sk, then the small
helper returns 0 as well, which in cls_cgroup terms means 'could not
classify'.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 12:41:30 -07:00
Chuck Lever 31193fe5f6 svcrdma: Remove svc_rdma_fastreg()
Commit 0bf4828983 ("svcrdma: refactor marshalling logic") removed
the last call site for svc_rdma_fastreg().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Chuck Lever 10dc451218 svcrdma: Clean up svc_rdma_get_reply_array()
Kernel coding conventions frown upon having large nontrivial
functions in header files, and the preference these days is to
allow the compiler to make inlining decisions if possible.

As these functions are re-homed into a .c file, be sure that
comparisons with fields in struct rpcrdma_msg are with be32
constants.

This is a refactoring change; no behavior change is intended.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Chuck Lever 9d11b51ce7 svcrdma: Fix send_reply() scatter/gather set-up
The Linux NFS server returns garbage in the data payload of inline
NFS/RDMA READ replies. These are READs of under 1000 bytes or so
where the client has not provided either a reply chunk or a write
list.

The NFS server delivers the data payload for an NFS READ reply to
the transport in an xdr_buf page list. If the NFS client did not
provide a reply chunk or a write list, send_reply() is supposed to
set up a separate sge for the page containing the READ data, and
another sge for XDR padding if needed, then post all of the sges via
a single SEND Work Request.

The problem is send_reply() does not advance through the xdr_buf
when setting up scatter/gather entries for SEND WR. It always calls
dma_map_xdr with xdr_off set to zero. When there's more than one
sge, dma_map_xdr() sets up the SEND sge's so they all point to the
xdr_buf's head.

The current Linux NFS/RDMA client always provides a reply chunk or
a write list when performing an NFS READ over RDMA. Therefore, it
does not exercise this particular case. The Linux server has never
had to use more than one extra sge for building RPC/RDMA replies
with a Linux client.

However, an NFS/RDMA client _is_ allowed to send small NFS READs
without setting up a write list or reply chunk. The NFS READ reply
fits entirely within the inline reply buffer in this case. This is
perhaps a more efficient way of performing NFS READs that the Linux
NFS/RDMA client may some day adopt.

Fixes: b432e6b3d9 ('svcrdma: Change DMA mapping logic to . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=285
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Shirley Ma ff79c74dca NFS/RDMA Release resources in svcrdma when device is removed
When removing underlying RDMA device, the rmmod will hang forever if there
are any outstanding NFS/RDMA client mounts. The outstanding NFS/RDMA counts
could also prevent the server from shutting down. Further debugging shows
that the existing connections are not teared down and resource are not
released when receiving RDMA_CM_EVENT_DEVICE_REMOVAL event. It seems the
original code missing svc_xprt_put() in RDMA_CM_EVENT_REMOVAL event handler
thus svc_xprt_free is never invoked to release the existing connection
resources.

The patch has been passed removing, adding device back and forth without
stopping NFS/RDMA service. This will also allow a device to be unplugged
and swapped out without shutting down NFS service.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=252
Signed-off-by: Shirley Ma <shirley.ma@oracle.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-07-20 14:58:47 -04:00
Pablo Neira Ayuso b64f48dcda Merge tag 'ipvs-fixes-for-v4.2' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs
Simon Horman says:

====================
IPVS Fixes for v4.2

please consider this fix for v4.2.
For reasons that are not clear to me it is a bumper crop.

It seems to me that they are all relevant to stable.
Please let me know if you need my help to get the fixes into stable.

* ipvs: fix ipv6 route unreach panic

  This problem appears to be present since IPv6 support was added to
  IPVS in v2.6.28.

* ipvs: skb_orphan in case of forwarding

  This appears to resolve a problem resulting from a side effect of
  41063e9dd1 ("ipv4: Early TCP socket demux.") which was included in v3.6.

* ipvs: do not use random local source address for tunnels

  This appears to resolve a problem introduced by
  026ace060d ("ipvs: optimize dst usage for real server") in v3.10.

* ipvs: fix crash if scheduler is changed

  This appears to resolve a problem introduced by
  ceec4c3816 ("ipvs: convert services to rcu") in v3.10.

  Julian has provided backports of the fix:
  * [PATCHv2 3.10.81] ipvs: fix crash if scheduler is changed
    http://www.spinics.net/lists/lvs-devel/msg04008.html
  * [PATCHv2 3.12.44,3.14.45,3.18.16,4.0.6] ipvs: fix crash if scheduler is changed
    http://www.spinics.net/lists/lvs-devel/msg04007.html

  Please let me know how you would like to handle guiding these
  backports into stable.

* ipvs: fix crash with sync protocol v0 and FTP

  This appears to resolve a problem introduced by
  749c42b620 ("ipvs: reduce sync rate with time thresholds") in v3.5
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-20 15:01:19 +02:00
Pablo Neira Ayuso 0838aa7fcf netfilter: fix netns dependencies with conntrack templates
Quoting Daniel Borkmann:

"When adding connection tracking template rules to a netns, f.e. to
configure netfilter zones, the kernel will endlessly busy-loop as soon
as we try to delete the given netns in case there's at least one
template present, which is problematic i.e. if there is such bravery that
the priviledged user inside the netns is assumed untrusted.

Minimal example:

  ip netns add foo
  ip netns exec foo iptables -t raw -A PREROUTING -d 1.2.3.4 -j CT --zone 1
  ip netns del foo

What happens is that when nf_ct_iterate_cleanup() is being called from
nf_conntrack_cleanup_net_list() for a provided netns, we always end up
with a net->ct.count > 0 and thus jump back to i_see_dead_people. We
don't get a soft-lockup as we still have a schedule() point, but the
serving CPU spins on 100% from that point onwards.

Since templates are normally allocated with nf_conntrack_alloc(), we
also bump net->ct.count. The issue why they are not yet nf_ct_put() is
because the per netns .exit() handler from x_tables (which would eventually
invoke xt_CT's xt_ct_tg_destroy() that drops reference on info->ct) is
called in the dependency chain at a *later* point in time than the per
netns .exit() handler for the connection tracker.

This is clearly a chicken'n'egg problem: after the connection tracker
.exit() handler, we've teared down all the connection tracking
infrastructure already, so rightfully, xt_ct_tg_destroy() cannot be
invoked at a later point in time during the netns cleanup, as that would
lead to a use-after-free. At the same time, we cannot make x_tables depend
on the connection tracker module, so that the xt_ct_tg_destroy() would
be invoked earlier in the cleanup chain."

Daniel confirms this has to do with the order in which modules are loaded or
having compiled nf_conntrack as modules while x_tables built-in. So we have no
guarantees regarding the order in which netns callbacks are executed.

Fix this by allocating the templates through kmalloc() from the respective
SYNPROXY and CT targets, so they don't depend on the conntrack kmem cache.
Then, release then via nf_ct_tmpl_free() from destroy_conntrack(). This branch
is marked as unlikely since conntrack templates are rarely allocated and only
from the configuration plane path.

Note that templates are not kept in any list to avoid further dependencies with
nf_conntrack anymore, thus, the tmpl larval list is removed.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
2015-07-20 14:58:19 +02:00
Eric W. Biederman e317fa505d netfilter: Fix memory leak in nf_register_net_hook
In the rare case that when it is a attempted to use a per network device
netfilter hook and the network device does not exist the newly allocated
structure can leak.

Be a good citizen and free the newly allocated structure in the error
handling code.

Fixes: 085db2c045 ("netfilter: Per network namespace netfilter hooks.")
Reported-by: kbuild@01.org
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-07-20 09:15:50 +02:00