Commit Graph

40628 Commits

Author SHA1 Message Date
Johannes Weiner baac50bbc3 net: tcp_memcontrol: simplify linkage between socket and page counter
There won't be any separate counters for socket memory consumed by
protocols other than TCP in the future.  Remove the indirection and link
sockets directly to their owning memory cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner e805605c72 net: tcp_memcontrol: sanitize tcp memory accounting callbacks
There won't be a tcp control soft limit, so integrating the memcg code
into the global skmem limiting scheme complicates things unnecessarily.
Replace this with simple and clear charge and uncharge calls--hidden
behind a jump label--to account skb memory.

Note that this is not purely aesthetic: as a result of shoehorning the
per-memcg code into the same memory accounting functions that handle the
global level, the old code would compare the per-memcg consumption
against the smaller of the per-memcg limit and the global limit.  This
allowed the total consumption of multiple sockets to exceed the global
limit, as long as the individual sockets stayed within bounds.  After
this change, the code will always compare the per-memcg consumption to
the per-memcg limit, and the global consumption to the global limit, and
thus close this loophole.

Without a soft limit, the per-memcg memory pressure state in sockets is
generally questionable.  However, we did it until now, so we continue to
enter it when the hard limit is hit, and packets are dropped, to let
other sockets in the cgroup know that they shouldn't grow their transmit
windows, either.  However, keep it simple in the new callback model and
leave memory pressure lazily when the next packet is accepted (as
opposed to doing it synchroneously when packets are processed).  When
packets are dropped, network performance will already be in the toilet,
so that should be a reasonable trade-off.

As described above, consumption is now checked on the per-memcg level
and the global level separately.  Likewise, memory pressure states are
maintained on both the per-memcg level and the global level, and a
socket is considered under pressure when either level asserts as much.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner 80f23124f5 net: tcp_memcontrol: simplify the per-memcg limit access
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup,
but it only ever sets these entries to the value of the memory_allocated
page_counter limit.  Use the latter directly.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner af95d7df40 net: tcp_memcontrol: remove dead per-memcg count of allocated sockets
The number of allocated sockets is used for calculations in the soft
limit phase, where packets are accepted but the socket is under memory
pressure.
 Since there is no soft limit phase in tcp_memcontrol, and memory
pressure is only entered when packets are already dropped, this is
actually dead code.  Remove it.

As this is the last user of parent_cg_proto(), remove that too.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Johannes Weiner 3d596f7b90 net: tcp_memcontrol: protect all tcp_memcontrol calls by jump-label
Move the jump-label from sock_update_memcg() and sock_release_memcg() to
the callsite, and so eliminate those function calls when socket
accounting is not enabled.

This also eliminates the need for dummy functions because the calls will
be optimized away if the Kconfig options are not enabled.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov 9ee11ba425 memcg: do not allow to disable tcp accounting after limit is set
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED
and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former
is never cleared while the latter can be cleared by unsetting the limit.
This allows to disable tcp socket accounting for new sockets after it
was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still
guaranteeing that memcg_socket_limit_enabled static key will be
decremented on memcg destruction.

This functionality looks dubious, because it is not clear what a use
case would be.  By enabling tcp accounting a user accepts the price.  If
they then find the performance degradation unacceptable, they can always
restart their workload with tcp accounting disabled.  It does not seem
there is any need to flip it while the workload is running.

Besides, it contradicts to how kmem accounting API works: writing
whatever to memory.kmem.limit_in_bytes enables kmem accounting for the
cgroup in question, after which it cannot be disabled.  Therefore one
might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just
enables socket accounting w/o limiting it, which might be useful by
itself, but it isn't true.

Since this API peculiarity is not documented anywhere, I propose to drop
it.  This will allow to simplify the code by dropping cg_proto->flags.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov 5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
David S. Miller 1adbfc435f Included bugfixes:
- avoid freeing batadv_hardif_neigh_node when still in use in other contexts
 - prevent lockdep splat in mcast_free during shutdown
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJWlovIAAoJENpFlCjNi1MRxccP/1+pDAPE9GpSUwOaDUYjWVYQ
 4k1jRdJZAiMTrLyr1pW3Gesq3wTwtcIjjevaZGOa3QyhKRPmVj7yHqeYv62+WX9x
 vEu6Mq6btBUqF+7+hCA/7n+Wyca+fsg9ge87hi6blmTSexlrcog80vTf2/N0CAQ4
 w9jVUWcWpK8fZ6ia5IiK/RNw9gL4onh98V/7WqcvUyOmBmlyQJTpdqvvj5DZGFbL
 u6B5hr0hzLirW2LJsk4fdYgKzyMYXBPtLZoiRiuk9GyjMDK+t0rUPdpuoPYhiVhH
 ZFidpICvNpBDQyDjITGX6XvAUqYI4bjDfxT3vdznkNhcmcPjrcITSw8x8u2rB3Xr
 TTVPtxi1OvGc0UCW+2Y47IJuHowZ7dKI0B2PiHdvLAlgM4dMYupZlruRMhR/XF0b
 x7wSsbETxO3g+SH3qqJVMyebrrX//uC19vThbq+gVzMEa1hkZSSvDZbp/c2RDlCA
 h5DtOjYPzCQeFo/8eorprUaO0zpbF0k/fTQu8f9csyozKVMdcCTO2X0vSRoXJG+m
 9EP4WiFhqRXZWlgrTAXVxTgPlP1KEDHTmjynC2iwkg1Fxje1xBWX+zdWAt9Vgrby
 HhYlb2wWg6tTt8zHjqwtOUqeCMDl8/7I+nk4GjJMzlNv3Iv2+3Eyj+DALOiQgt1S
 xqvE8KkN7EEAzlzkqDPF
 =6uBh
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge

Antonio Quartulli says:

====================
net: batman-adv 20160114

Included bugfixes:
- avoid freeing batadv_hardif_neigh_node when still in use in other contexts
- prevent lockdep splat in mcast_free during shutdown
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-13 14:57:42 -05:00
David S. Miller b8e429a2fe genetlink: Fix off-by-one in genl_allocate_reserve_groups()
The bug fix for adding n_groups to the computation forgot
to adjust ">=" to ">" to keep the condition correct.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-13 10:28:06 -05:00
Sven Eckelmann bab7c6c3de batman-adv: Fix list removal of batadv_hardif_neigh_node
The neigh_list with batadv_hardif_neigh_node objects is accessed with only
rcu_read_lock in batadv_hardif_neigh_get and batadv_iv_neigh_print. Thus it
is not allowed to kfree the object before the rcu grace period ends (which
may still protects context accessing this object). Therefore the object has
first to be removed from the neigh_list and then it has either wait with
synchronize_rcu or call_rcu till the grace period ends before it can be
freed.

Fixes: cef63419f7 ("batman-adv: add list of unique single hop neighbors per hard-interface")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-13 19:28:27 +08:00
Simon Wunderlich af63cf51b7 batman-adv: fix lockdep splat when doing mcast_free
While testing, we got something like this:

WARNING: CPU: 0 PID: 238 at net/batman-adv/multicast.c:142 batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv]()
[...]
Call Trace:
[<ffffffff815fc597>] dump_stack+0x4b/0x64
[<ffffffff810b34dc>] warn_slowpath_common+0xbc/0x120
[<ffffffffa0024ec5>] ? batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv]
[<ffffffff810b3705>] warn_slowpath_null+0x15/0x20
[<ffffffffa0024ec5>] batadv_mcast_mla_tt_retract+0x94/0x205 [batman_adv]
[<ffffffffa00273fe>] batadv_mcast_free+0x36/0x39 [batman_adv]
[<ffffffffa0020c77>] batadv_mesh_free+0x7d/0x13f [batman_adv]
[<ffffffffa0036a6b>] batadv_softif_free+0x15/0x25 [batman_adv]
[...]

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-13 19:26:25 +08:00
David S. Miller ddb5388ffd Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux 2016-01-13 00:21:27 -05:00
Linus Torvalds aee3bfa330 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from Davic Miller:

 1) Support busy polling generically, for all NAPI drivers.  From Eric
    Dumazet.

 2) Add byte/packet counter support to nft_ct, from Floriani Westphal.

 3) Add RSS/XPS support to mvneta driver, from Gregory Clement.

 4) Implement IPV6_HDRINCL socket option for raw sockets, from Hannes
    Frederic Sowa.

 5) Add support for T6 adapter to cxgb4 driver, from Hariprasad Shenai.

 6) Add support for VLAN device bridging to mlxsw switch driver, from
    Ido Schimmel.

 7) Add driver for Netronome NFP4000/NFP6000, from Jakub Kicinski.

 8) Provide hwmon interface to mlxsw switch driver, from Jiri Pirko.

 9) Reorganize wireless drivers into per-vendor directories just like we
    do for ethernet drivers.  From Kalle Valo.

10) Provide a way for administrators "destroy" connected sockets via the
    SOCK_DESTROY socket netlink diag operation.  From Lorenzo Colitti.

11) Add support to add/remove multicast routes via netlink, from Nikolay
    Aleksandrov.

12) Make TCP keepalive settings per-namespace, from Nikolay Borisov.

13) Add forwarding and packet duplication facilities to nf_tables, from
    Pablo Neira Ayuso.

14) Dead route support in MPLS, from Roopa Prabhu.

15) TSO support for thunderx chips, from Sunil Goutham.

16) Add driver for IBM's System i/p VNIC protocol, from Thomas Falcon.

17) Rationalize, consolidate, and more completely document the checksum
    offloading facilities in the networking stack.  From Tom Herbert.

18) Support aborting an ongoing scan in mac80211/cfg80211, from
    Vidyullatha Kanchanapally.

19) Use per-bucket spinlock for bpf hash facility, from Tom Leiming.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1375 commits)
  net: bnxt: always return values from _bnxt_get_max_rings
  net: bpf: reject invalid shifts
  phonet: properly unshare skbs in phonet_rcv()
  dwc_eth_qos: Fix dma address for multi-fragment skbs
  phy: remove an unneeded condition
  mdio: remove an unneed condition
  mdio_bus: NULL dereference on allocation error
  net: Fix typo in netdev_intersect_features
  net: freescale: mac-fec: Fix build error from phy_device API change
  net: freescale: ucc_geth: Fix build error from phy_device API change
  bonding: Prevent IPv6 link local address on enslaved devices
  IB/mlx5: Add flow steering support
  net/mlx5_core: Export flow steering API
  net/mlx5_core: Make ipv4/ipv6 location more clear
  net/mlx5_core: Enable flow steering support for the IB driver
  net/mlx5_core: Initialize namespaces only when supported by device
  net/mlx5_core: Set priority attributes
  net/mlx5_core: Connect flow tables
  net/mlx5_core: Introduce modify flow table command
  net/mlx5_core: Managing root flow table
  ...
2016-01-12 18:57:02 -08:00
Linus Torvalds 33caf82acf Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "All kinds of stuff.  That probably should've been 5 or 6 separate
  branches, but by the time I'd realized how large and mixed that bag
  had become it had been too close to -final to play with rebasing.

  Some fs/namei.c cleanups there, memdup_user_nul() introduction and
  switching open-coded instances, burying long-dead code, whack-a-mole
  of various kinds, several new helpers for ->llseek(), assorted
  cleanups and fixes from various people, etc.

  One piece probably deserves special mention - Neil's
  lookup_one_len_unlocked().  Similar to lookup_one_len(), but gets
  called without ->i_mutex and tries to avoid ever taking it.  That, of
  course, means that it's not useful for any directory modifications,
  but things like getting inode attributes in nfds readdirplus are fine
  with that.  I really should've asked for moratorium on lookup-related
  changes this cycle, but since I hadn't done that early enough...  I
  *am* asking for that for the coming cycle, though - I'm going to try
  and get conversion of i_mutex to rwsem with ->lookup() done under lock
  taken shared.

  There will be a patch closer to the end of the window, along the lines
  of the one Linus had posted last May - mechanical conversion of
  ->i_mutex accesses to inode_lock()/inode_unlock()/inode_trylock()/
  inode_is_locked()/inode_lock_nested().  To quote Linus back then:

    -----
    |    This is an automated patch using
    |
    |        sed 's/mutex_lock(&\(.*\)->i_mutex)/inode_lock(\1)/'
    |        sed 's/mutex_unlock(&\(.*\)->i_mutex)/inode_unlock(\1)/'
    |        sed 's/mutex_lock_nested(&\(.*\)->i_mutex,[     ]*I_MUTEX_\([A-Z0-9_]*\))/inode_lock_nested(\1, I_MUTEX_\2)/'
    |        sed 's/mutex_is_locked(&\(.*\)->i_mutex)/inode_is_locked(\1)/'
    |        sed 's/mutex_trylock(&\(.*\)->i_mutex)/inode_trylock(\1)/'
    |
    |    with a very few manual fixups
    -----

  I'm going to send that once the ->i_mutex-affecting stuff in -next
  gets mostly merged (or when Linus says he's about to stop taking
  merges)"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  nfsd: don't hold i_mutex over userspace upcalls
  fs:affs:Replace time_t with time64_t
  fs/9p: use fscache mutex rather than spinlock
  proc: add a reschedule point in proc_readfd_common()
  logfs: constify logfs_block_ops structures
  fcntl: allow to set O_DIRECT flag on pipe
  fs: __generic_file_splice_read retry lookup on AOP_TRUNCATED_PAGE
  fs: xattr: Use kvfree()
  [s390] page_to_phys() always returns a multiple of PAGE_SIZE
  nbd: use ->compat_ioctl()
  fs: use block_device name vsprintf helper
  lib/vsprintf: add %*pg format specifier
  fs: use gendisk->disk_name where possible
  poll: plug an unused argument to do_poll
  amdkfd: don't open-code memdup_user()
  cdrom: don't open-code memdup_user()
  rsxx: don't open-code memdup_user()
  mtip32xx: don't open-code memdup_user()
  [um] mconsole: don't open-code memdup_user_nul()
  [um] hostaudio: don't open-code memdup_user()
  ...
2016-01-12 17:11:47 -08:00
Rabin Vincent 229394e8e6 net: bpf: reject invalid shifts
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT.  Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-12 17:06:53 -05:00
Matti Vaittinen ccdf6ce6a8 net: netlink: Fix multicast group storage allocation for families with more than one groups
Multicast groups are stored in global buffer. Check for needed buffer size
incorrectly compares buffer size to first id for family. This means that
for families with more than one mcast id one may allocate too small buffer
and end up writing rest of the groups to some unallocated memory. Fix the
buffer size check to compare allocated space to last mcast id for the
family.

Tested on ARM using kernel 3.14

Signed-off-by: Matti Vaittinen <matti.vaittinen@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-12 16:40:15 -05:00
Rabin Vincent 06928b3870 net: bpf: reject invalid shifts
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT.  Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-12 15:55:39 -05:00
Eric Dumazet 7aaed57c5c phonet: properly unshare skbs in phonet_rcv()
Ivaylo Dimitrov reported a regression caused by commit 7866a62104
("dev: add per net_device packet type chains").

skb->dev becomes NULL and we crash in __netif_receive_skb_core().

Before above commit, different kind of bugs or corruptions could happen
without major crash.

But the root cause is that phonet_rcv() can queue skb without checking
if skb is shared or not.

Many thanks to Ivaylo Dimitrov for his help, diagnosis and tests.

Reported-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com>
Tested-by: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-12 12:05:38 -05:00
David S. Miller 9d367eddf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_main.c
	drivers/net/ethernet/mellanox/mlxsw/spectrum.h
	drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 23:55:43 -05:00
Michal Kubeček 40ba330227 udp: disallow UFO for sockets with SO_NO_CHECK option
Commit acf8dd0a9d ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().

In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Tested-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:40:57 -05:00
John Fastabend 3de03596df net: pktgen: fix null ptr deref in skb allocation
Fix possible null pointer dereference that may occur when calling
skb_reserve() on a null skb.

Fixes: 879c7220e8 ("net: pktgen: Observe needed_headroom of the device")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:39:44 -05:00
Daniel Borkmann c6c3345407 bpf: support ipv6 for bpf_skb_{set,get}_tunnel_key
After IPv6 support has recently been added to metadata dst and related
encaps, add support for populating/reading it from an eBPF program.

Commit d3aa45ce6b ("bpf: add helpers to access tunnel metadata") started
with initial IPv4-only support back then (due to IPv6 metadata support
not being available yet).

To stay compatible with older programs, we need to test for the passed
structure size. Also TOS and TTL support from the ip_tunnel_info key has
been added. Tested with vxlan devs in collect meta data mode with IPv4,
IPv6 and in compat mode over different network namespaces.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:32:55 -05:00
Daniel Borkmann 781c53bc5d bpf: export helper function flags and reject invalid ones
Export flags used by eBPF helper functions through UAPI, so they can be
used by programs (instead of them redefining all flags each time or just
using the hard-coded values). It also gives a better overview what flags
are used where and we can further get rid of the extra macros defined in
filter.c. Moreover, reject invalid flags.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:32:55 -05:00
Jamal Hadi Salim 66530bdf85 sched,cls_flower: set key address type when present
only when user space passes the addresses should we consider their
presence

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:27:30 -05:00
Neal Cardwell 83d15e70c4 tcp_yeah: don't set ssthresh below 2
For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).

tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.

Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:25:16 -05:00
Marcelo Ricardo Leitner 649621e3d5 sctp: fix use-after-free in pr_debug statement
Dmitry Vyukov reported a use-after-free in the code expanded by the
macro debug_post_sfx, which is caused by the use of the asoc pointer
after it was freed within sctp_side_effect() scope.

This patch fixes it by allowing sctp_side_effect to clear that asoc
pointer when the TCB is freed.

As Vlad explained, we also have to cover the SCTP_DISPOSITION_ABORT case
because it will trigger DELETE_TCB too on that same loop.

Also, there were places issuing SCTP_CMD_INIT_FAILED and ASSOC_FAILED
but returning SCTP_DISPOSITION_CONSUME, which would fool the scheme
above. Fix it by returning SCTP_DISPOSITION_ABORT instead.

The macro is already prepared to handle such NULL pointer.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 17:13:01 -05:00
Alexander Kuleshov 617cfc7530 net/rtnetlink: remove unused sz_idx variable
The sz_idx variable is defined in the rtnetlink_rcv_msg(), but
not used anywhere. Let's remove it.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 00:22:20 -05:00
willy tarreau 712f4aad40 unix: properly account for FDs passed over unix sockets
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-11 00:05:30 -05:00
Jean Sacren c5420eb12f openvswitch: update kernel doc for struct vport
commit be4ace6e6b ("openvswitch: Move dev pointer into vport itself")

The commit above added @dev and moved @rcu to the bottom of struct
vport, but the change was not reflected in the kernel doc. So let's
update the kernel doc as well.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 23:49:21 -05:00
Jean Sacren 2f7066ada1 openvswitch: fix struct geneve_port member name
commit 6b001e682e ("openvswitch: Use Geneve device.")

The commit above introduced 'port_no' as the name for the member of
struct geneve_port. The correct name should be 'dst_port' as described
in the kernel doc. Let's fix that member name and all the pertinent
instances so that both doc and code would be consistent.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 23:49:21 -05:00
Jean Sacren 5ea030429f openvswitch: clean up unused function
commit 6b001e682e ("openvswitch: Use Geneve device.")

The commit above deleted the only call site of ovs_tunnel_route_lookup()
and now that function is not used any more. So let's delete the function
definition as well.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 23:49:21 -05:00
Eric Dumazet 3e4006f0b8 ipv6: tcp: add rcu locking in tcp_v6_send_synack()
When first SYNACK is sent, we already hold rcu_read_lock(), but this
is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
does not hold rcu_read_lock()

Fixes: 45f6fad84c ("ipv6: add complete rcu protection around np->opt")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:58:03 -05:00
Eric Dumazet a78cb84c62 net: add scheduling point in recvmmsg/sendmmsg
Applications often have to reduce number of datagrams
they receive or send per system call to avoid starvation problems.

Really the kernel should take care of this by using cond_resched(),
so that applications can experiment bigger batch sizes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:56:29 -05:00
Lubomir Rintel 3d171f3907 ipv6: always add flag an address that failed DAD with DADFAILED
The userspace needs to know why is the address being removed so that it can
perhaps obtain a new address.

Without the DADFAILED flag it's impossible to distinguish removal of a
temporary and tentative address due to DAD failure from other reasons (device
removed, manual address removal).

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:54:27 -05:00
Daniel Borkmann 1f211a1b92 net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

   # tc qdisc add dev foo clsact
   # tc qdisc show dev foo
   qdisc mq 0: root
   qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

   # tc filter add dev foo ingress bpf da obj bar.o sec ingress
   # tc filter add dev foo egress  bpf da obj bar.o sec egress
   # tc filter show dev foo ingress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
   # tc filter show dev foo egress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

  [1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:13:15 -05:00
Sasha Levin 320f1a4a17 net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory
proc_dostring() needs an initialized destination string, while the one
provided in proc_sctp_do_hmac_alg() contains stack garbage.

Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
accessing invalid memory.

Fixes: 3c68198e7 ("sctp: Make hmac algorithm selection for cookie generation dynamic")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 18:01:01 -05:00
Daniel Borkmann f8ffad69c9 bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions
Add a small helper skb_postpush_rcsum() and fix up redirect locations
that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
a proper csum that covers also Ethernet header, f.e. since 2c26d34bbc
("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
also do skb_postpull_rcsum() after pulling Ethernet header off via
eth_type_trans().

When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
I can trigger the following csum issue with an IPv6 setup:

  [  505.144065] dummy1: hw csum failure
  [...]
  [  505.144108] Call Trace:
  [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
  [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
  [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
  [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
  [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
  [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
  [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
  [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
  [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
  [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
  [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
  [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
  [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
  [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
  [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
  [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
  [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
  [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
  [...]

What happens is that on ingress, we push Ethernet header back in, either
from cls_bpf or right before skb_do_redirect(), but without updating csum.
The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
helper for the dev_forward_skb() case to correct the csum diff again.

Thanks to Hannes Frederic Sowa for the csum_partial() idea!

Fixes: 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper")
Fixes: 27b29f6305 ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:54:28 -05:00
Daniel Borkmann fdc5432a7b net, sched: add skb_at_tc_ingress helper
Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
can hide the ugly ifdef.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:54:28 -05:00
Nikolay Borisov b840d15d39 ipv4: Namespecify the tcp_keepalive_intvl sysctl knob
This is the final part required to namespaceify the tcp
keep alive mechanism.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Nikolay Borisov 9bd6861bd4 ipv4: Namespecify tcp_keepalive_probes sysctl knob
This is required to have full tcp keepalive mechanism namespace
support.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Nikolay Borisov 13b287e8d1 ipv4: Namespaceify tcp_keepalive_time sysctl knob
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Hannes Frederic Sowa 787d7ac308 udp: restrict offloads to one namespace
udp tunnel offloads tend to aggregate datagrams based on inner
headers. gro engine gets notified by tunnel implementations about
possible offloads. The match is solely based on the port number.

Imagine a tunnel bound to port 53, the offloading will look into all
DNS packets and tries to aggregate them based on the inner data found
within. This could lead to data corruption and malformed DNS packets.

While this patch minimizes the problem and helps an administrator to find
the issue by querying ip tunnel/fou, a better way would be to match on
the specific destination ip address so if a user space socket is bound
to the same address it will conflict.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:28:24 -05:00
Elad Raz f1fecb1d10 bridge: Reflect MDB entries to hardware
Offload MDB changes per port to hardware

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 16:50:21 -05:00
Elad Raz 4d41e12593 switchdev: Adding MDB entry offload
Define HW multicast entry: MAC and VID.
Using a MAC address simplifies support for both IPV4 and IPv6.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 16:50:20 -05:00
Sven Eckelmann ed21d170e8 batman-adv: Add kerneldoc for batadv_neigh_node::refcount
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 8a3719a184 batman-adv: Remove kerneldoc for missing struct members
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 006a199d5d batman-adv: Fix kerneldoc member names in for main structs
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 426fc6c811 batman-adv: Fix kernel-doc parsing of main structs
kernel-doc is not able to skip an #ifdef between the kernel documentation
block and the start of the struct. Moving the #ifdef before the kernel doc
block avoids this problem

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring e087f34f28 batman-adv: Split a condition check
Let us split a check for a condition at the beginning of the
batadv_is_ap_isolated() function so that a direct return can be performed
in this function if the variable "vlan" contained a null pointer.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring f75a33aeed batman-adv: Delete an unnecessary check before the function call "batadv_softif_vlan_free_ref"
The batadv_softif_vlan_free_ref() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring 8bbb7cb232 batman-adv: Less checks in batadv_tvlv_unicast_send()
* Let us return directly if a call of the batadv_orig_hash_find() function
  returned a null pointer.

* Omit the initialisation for the variable "skb" at the beginning.

* Replace an assignment by a call of the kfree_skb() function
  and delete the affected variable "ret" then.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Markus Elfring c799443ee1 batman-adv: Delete unnecessary checks before the function call "kfree_skb"
The kfree_skb() function tests whether its argument is NULL and then
returns immediately. Thus the test around the calls is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann d737ccbed3 batman-adv: Add function to convert string to batadv throughput
The code to convert the throughput information from a string to the
batman-adv internal (100Kibit/s) representation is duplicated in
batadv_parse_gw_bandwidth. Move this functionality to its own function
batadv_parse_throughput to reduce the code complexity.

Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich 9e728e8438 batman-adv: only call post function if something changed
Currently, the post function is also called on errors or if there were
no changes, which is redundant for the functions currently using these
facilities.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich e1544f3c87 batman-adv: increase BLA wait periods to 6
If networks take a long time to come up, e.g. due to lossy links, then
the bridge loop avoidance wait time to suppress broadcasts may not wait
long enough and detect a backbone before the mesh is brought up.
Increasing the wait period further to 60 seconds makes this scenario
less likely.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich d68081a240 batman-adv: purge bridge loop avoidance when its disabled
When bridge loop avoidance is disabled through sysfs, the internal
datastructures are not disabled, but only BLA operations are disabled.
To be sure that they are removed, purge the data immediately. That is
especially useful if a firmwares network state is changed, and the BLA
wait periods should restart on the new network.

Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Marek Lindner 143d157c9e batman-adv: remove leftovers of unused BATADV_PRIMARIES_FIRST_HOP flag
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Sven Eckelmann 008a374487 batman-adv: Fix lockdep annotation of batadv_tlv_container_remove
The function handles tlv containers and not tlv handlers. Thus the
lockdep_assert_held has to check for the container_list lock.

Fixes: 2c72d655b0 ("batman-adv: Annotate deleting functions with external lock via lockdep")
Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Simon Wunderlich 4a4d045eb2 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-09 20:56:00 +08:00
Lance Richardson ad64b8be71 ipv4: eliminate lock count warnings in ping.c
Add lock release/acquire annotations to ping_seq_start() and
ping_seq_stop() to satisfy sparse.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 21:30:43 -05:00
Lance Richardson 30d3d83a7d ipv4: fix endianness warnings in ip_tunnel_core.c
Eliminate endianness mismatch warnings (reported by sparse) in this file by
using appropriate nla_put_*()/nla_get_*() calls.

Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 21:30:43 -05:00
Al Viro 6108209c4a Merge branch 'for-linus' into work.misc 2016-01-08 21:20:11 -05:00
David S. Miller 9b59377b75 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) Release nf_tables objects on netns destructions via
   nft_release_afinfo().

2) Destroy basechain and rules on netdevice removal in the new netdev
   family.

3) Get rid of defensive check against removal of inactive objects in
   nf_tables.

4) Pass down netns pointer to our existing nfnetlink callbacks, as well
   as commit() and abort() nfnetlink callbacks.

5) Allow to invert limit expression in nf_tables, so we can throttle
   overlimit traffic.

6) Add packet duplication for the netdev family.

7) Add forward expression for the netdev family.

8) Define pr_fmt() in conntrack helpers.

9) Don't leave nfqueue configuration on inconsistent state in case of
   errors, from Ken-ichirou MATSUZAWA, follow up patches are also from
   him.

10) Skip queue option handling after unbind.

11) Return error on unknown both in nfqueue and nflog command.

12) Autoload ctnetlink when NFQA_CFG_F_CONNTRACK is set.

13) Add new NFTA_SET_USERDATA attribute to store user data in sets,
    from Carlos Falgueras.

14) Add support for 64 bit byteordering changes nf_tables, from Florian
    Westphal.

15) Add conntrack byte/packet counter matching support to nf_tables,
    also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 20:53:16 -05:00
David S. Miller 250fbf129e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2016-01-08

Here's one more bluetooth-next pull request for the 4.5 kernel:

 - Support for CRC check and promiscuous mode for CC2520
 - Fixes to btmrvl driver
 - New ACPI IDs for hci_bcm driver
 - Limited Discovery support for the Bluetooth mgmt interface
 - Minor other cleanups here and there

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-08 13:17:31 -05:00
Florian Westphal 48f66c905a netfilter: nft_ct: add byte/packet counter support
If the accounting extension isn't present, we'll return a counter
value of 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 14:44:09 +01:00
Florian Westphal ce1e7989d9 netfilter: nft_byteorder: provide 64bit le/be conversion
Needed to convert the (64bit) conntrack counters to BE ordering.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:32:11 +01:00
Carlos Falgueras García e6d8ecac9e netfilter: nf_tables: Add new attributes into nft_set to store user data.
User data is stored at after 'nft_set_ops' private data into 'data[]'
flexible array. The field 'udata' points to user data and 'udlen' stores
its length.

Add new flag NFTA_SET_USERDATA.

Signed-off-by: Carlos Falgueras García <carlosfg@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:08 +01:00
Ken-ichirou MATSUZAWA eb075954e9 netfilter: nfnetlink_log: just returns error for unknown command
This patch stops processing options for unknown command.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:07 +01:00
Ken-ichirou MATSUZAWA 71b2e5f5ca netfilter: nfnetlink_queue: autoload nf_conntrack_netlink module NFQA_CFG_F_CONNTRACK config flag
This patch enables to load nf_conntrack_netlink module if
NFQA_CFG_F_CONNTRACK config flag is specified.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:06 +01:00
Ken-ichirou MATSUZAWA 21c3c971d1 netfilter: nfnetlink_queue: just returns error for unknown command
This patch stops processing options for unknown command.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:05 +01:00
Ken-ichirou MATSUZAWA 17bc6b4884 netfilter: nfnetlink_queue: don't handle options after unbind
This patch stops processing after destroying a queue instance.

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:04 +01:00
Ken-ichirou MATSUZAWA 60d2c7f9ab netfilter: nfnetlink_queue: validate dependencies to avoid breaking atomicity
Check that dependencies are fulfilled before updating the queue
instance, otherwise we can leave things in intermediate state on errors
in nfqnl_recv_config().

Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-01-08 13:25:03 +01:00
Trond Myklebust daaadd2283 Merge branch 'bugfixes'
* bugfixes:
  SUNRPC: Fixup socket wait for memory
  SUNRPC: Fix a missing break in rpc_anyaddr()
  pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh()
  NFS: Fix attribute cache revalidation
  NFS: Ensure we revalidate attributes before using execute_ok()
  NFS: Flush reclaim writes using FLUSH_COND_STABLE
  NFS: Background flush should not be low priority
  NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn
  NFSv4: Don't perform cached access checks before we've OPENed the file
  NFS: Allow the combination pNFS and labeled NFS
  NFS42: handle layoutstats stateid error
  nfs: Fix race in __update_open_stateid()
  nfs: fix missing assignment in nfs4_sequence_done tracepoint
2016-01-07 18:45:36 -05:00
Andrew Lunn 0071f56e46 dsa: Register netdev before phy
When the phy is connected, an info message is printed. If the netdev
it is attached to has not been registered yet, the name
'uninitialised' in the output. By registering the netdev first, then
connecting they phy, we can avoid this.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:26 -05:00
Andrew Lunn 7f854420fb phy: Add API for {un}registering an mdio device to a bus.
Rather than have drivers directly manipulate the mii_bus structure,
provide and API for registering and unregistering devices on an MDIO
bus, and performing lookups.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:26 -05:00
Andrew Lunn e5a03bfd87 phy: Add an mdio_device structure
Not all devices attached to an MDIO bus are phys. So add an
mdio_device structure to represent the generic parts of an mdio
device, and place this structure into the phy_device.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:26 -05:00
Andrew Lunn 2220943a21 phy: Centralise print about attached phy
Many Ethernet drivers contain the same netdev_info() print statement
about the attached phy. Move it into the phy device code. Additionally
add a varargs function which can be used to append additional
information.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 14:31:25 -05:00
J. Bruce Fields 3daa020f9b Revert "svcrdma: Do not send XDR roundup bytes for a write chunk"
This reverts commit 6f18dc8939.

Just as one example, it appears this code could do the wrong thing in
the case of a two-byte NFS READ that crosses a page boundary.

Chuck says: "In that case, nfsd would pass down an xdr_buf that has one
byte in a page, one byte in another page, and a two-byte XDR pad. The
logic introduced by this optimization would be fooled, and neither the
second byte nor the XDR pad would be written to the client."

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2016-01-07 10:10:48 -05:00
Sven Eckelmann 13bbdd370f batman-adv: Fix invalid read while copying bat_iv.bcast_own
batadv_iv_ogm_orig_del_if removes a part of the bcast_own which previously
belonged to the now removed interface. This is done by copying all data
which comes before the removed interface and then appending all the data
which comes after the removed interface.

The address calculation for the position of the data which comes after the
removed interface assumed that the bat_iv.bcast_own is a pointer to a
single byte datatype. But it is a pointer to unsigned long and thus the
calculated position was wrong off factor sizeof(unsigned long).

Fixes: 83a8342678a0 ("more basic routing code added (forwarding packets /
bitarray added)")

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <a@unstable.cc>
2016-01-07 14:24:05 +08:00
David S. Miller 9e0efaf6b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-01-06 22:54:18 -05:00
Yuchung Cheng 8b8a321ff7 tcp: fix zero cwnd in tcp_cwnd_reduction
Patch 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode
conditionally") introduced a bug that cwnd may become 0 when both
inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
to a div-by-zero if the connection starts another cwnd reduction
phase by setting tp->prior_cwnd to the current cwnd (0) in
tcp_init_cwnd_reduction().

To prevent this we skip PRR operation when nothing is acked or
sacked. Then cwnd must be positive in all cases as long as ssthresh
is positive:

1) The proportional reduction mode
   inflight > ssthresh > 0

2) The reduction bound mode
  a) inflight == ssthresh > 0

  b) inflight < ssthresh
     sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh

Therefore in all cases inflight and sndcnt can not both be 0.
We check invalid tp->prior_cwnd to avoid potential div0 bugs.

In reality this bug is triggered only with a sequence of less common
events.  For example, the connection is terminating an ECN-triggered
cwnd reduction with an inflight 0, then it receives reordered/old
ACKs or DSACKs from prior transmission (which acks nothing). Or the
connection is in fast recovery stage that marks everything lost,
but fails to retransmit due to local issues, then receives data
packets from other end which acks nothing.

Fixes: 3759824da8 ("tcp: PRR uses CRB mode by default and SS mode conditionally")
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 16:39:56 -05:00
David S. Miller c7f5d10549 net: Add eth_platform_get_mac_address() helper.
A repeating pattern in drivers has become to use OF node information
and, if not found, platform specific host information to extract the
ethernet address for a given device.

Currently this is done with a call to of_get_mac_address() and then
some ifdef'd stuff for SPARC.

Consolidate this into a portable routine, and provide the
arch_get_platform_mac_address() weak function hook for all
architectures to implement if they want.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 16:31:56 -05:00
Francesco Ruggeri 07a5d38453 net: possible use after free in dst_release
dst_release should not access dst->flags after decrementing
__refcnt to 0. The dst_entry may be in dst_busy_list and
dst_gc_task may dst_destroy it before dst_release gets a chance
to access dst->flags.

Fixes: d69bbf88c8 ("net: fix a race in dst_release()")
Fixes: 27b75c95f1 ("net: avoid RCU for NOCACHE dst")
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 15:00:27 -05:00
Elad Raz 404cdbf089 bridge: add vlan filtering change for new bridged device
Notifying hardware about newly bridged port vlan-aware changes.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 14:42:41 -05:00
Elad Raz 6b72a77020 bridge: add vlan filtering change notification
Notifying hardware about bridge vlan-aware changes.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 14:42:40 -05:00
Elad Raz 08474cc1e6 bridge: Propagate vlan add failure to user
Disallow adding interfaces to a bridge when vlan filtering operation
failed. Send the failure code to the user.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 14:42:40 -05:00
Trond Myklebust 13331a551a SUNRPC: Fixup socket wait for memory
We're seeing hangs in the NFS client code, with loops of the form:

 RPC: 30317 xmit incomplete (267368 left of 524448)
 RPC: 30317 call_status (status -11)
 RPC: 30317 call_transmit (status 0)
 RPC: 30317 xprt_prepare_transmit
 RPC: 30317 xprt_transmit(524448)
 RPC:       xs_tcp_send_request(267368) = -11
 RPC: 30317 xmit incomplete (267368 left of 524448)
 RPC: 30317 call_status (status -11)
 RPC: 30317 call_transmit (status 0)
 RPC: 30317 xprt_prepare_transmit
 RPC: 30317 xprt_transmit(524448)

Turns out commit ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
moved SOCKWQ_ASYNC_NOSPACE out of sock->flags and into sk->sk_wq->flags,
however it never tried to fix up the code in net/sunrpc.

The new idiom is to use the flags in the RCU protected struct socket_wq.
While we're at it, clear out the now redundant places where we set/clear
SOCKWQ_ASYNC_NOSPACE and SOCK_NOSPACE. In principle, sk_stream_wait_memory()
is supposed to set these for us, so we only need to clear them in the
particular case of our ->write_space() callback.

Fixes: ceb5d58b21 ("net: fix sock_wake_async() rcu protection")
Cc: Eric Dumazet <edumazet@google.com>
Cc: stable@vger.kernel.org # 4.4
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2016-01-06 12:22:25 -05:00
Johannes Berg 787b306cf3 Bluetooth: avoid rebuilding hci_sock all the time
Instead, allow using string formatting with send_monitor_note()
and access init_utsname().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-06 16:36:44 +01:00
John Fastabend 73c20a8b72 net: sched: fix missing free per cpu on qstats
When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.

Fixes: b0ab6f9275 ("net: sched: enable per cpu qstats")
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:40:21 -05:00
Craig Gallek 00ce3a15d8 soreuseport: change consume_skb to kfree_skb in error case
Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:30:27 -05:00
Craig Gallek 1134158ba3 soreuseport: pass skb to secondary UDP socket lookup
This socket-lookup path did not pass along the skb in question
in my original BPF-based socket selection patch.  The skb in the
udpN_lib_lookup2 path can be used for BPF-based socket selection just
like it is in the 'traditional' udpN_lib_lookup path.

udpN_lib_lookup2 kicks in when there are greater than 10 sockets in
the same hlist slot.  Coincidentally, I chose 10 sockets per
reuseport group in my functional test, so the lookup2 path was not
excersised. This adds an additional set of tests with 20 sockets.

Fixes: 538950a1b7 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF")
Fixes: 3ca8e40299 ("soreuseport: BPF selection functional test")
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06 01:28:04 -05:00
Florian Westphal a72a5e2d34 inet: kill unused skb_free op
The only user was removed in commit
029f7f3b87 ("netfilter: ipv6: nf_defrag: avoid/free clone operations").

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 22:25:57 -05:00
Hannes Frederic Sowa ff62198553 bridge: Only call /sbin/bridge-stp for the initial network namespace
[I stole this patch from Eric Biederman. He wrote:]

> There is no defined mechanism to pass network namespace information
> into /sbin/bridge-stp therefore don't even try to invoke it except
> for bridge devices in the initial network namespace.
>
> It is possible for unprivileged users to cause /sbin/bridge-stp to be
> invoked for any network device name which if /sbin/bridge-stp does not
> guard against unreasonable arguments or being invoked twice on the
> same network device could cause problems.

[Hannes: changed patch using netns_eq]

Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 16:46:17 -05:00
Xin Long c79c066691 sctp: remove the local_bh_disable/enable in sctp_endpoint_lookup_assoc
sctp_endpoint_lookup_assoc is called in the protection of sock lock
there is no need to call local_bh_disable in this function. so remove
them.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:02 -05:00
Xin Long b5eff71283 sctp: drop the old assoc hashtable of sctp
transport hashtable will replace the association hashtable,
so association hashtable is not used in sctp any more, so
drop the codes about that.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:01 -05:00
Xin Long 39f66a7dce sctp: apply rhashtable api to sctp procfs
Traversal the transport rhashtable, get the association only once through
the condition assoc->peer.primary_path != transport.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:01 -05:00
Xin Long 4f00878126 sctp: apply rhashtable api to send/recv path
apply lookup apis to two functions, for __sctp_endpoint_lookup_assoc
and __sctp_lookup_association, it's invoked in the protection of sock
lock, it will be safe, but sctp_lookup_association need to call
rcu_read_lock() and to detect the t->dead to protect it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:01 -05:00
Xin Long d6c0256a60 sctp: add the rhashtable apis for sctp global transport hashtable
tranport hashtbale will replace the association hashtable to do the
lookup for transport, and then get association by t->assoc, rhashtable
apis will be used because of it's resizable, scalable and using rcu.

lport + rport + paddr will be the base hashkey to locate the chain,
with net to protect one netns from another, then plus the laddr to
compare to get the target.

this patch will provider the lookup functions:
- sctp_epaddr_lookup_transport
- sctp_addrs_lookup_transport

hash/unhash functions:
- sctp_hash_transport
- sctp_unhash_transport

init/destroy functions:
- sctp_transport_hashtable_init
- sctp_transport_hashtable_destroy

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-05 12:24:00 -05:00
Johan Hedberg 78b781ca0d Bluetooth: Add support for Start Limited Discovery command
This patch implements the mgmt Start Limited Discovery command. Most
of existing Start Discovery code is reused since the only difference
is the presence of a 'limited' flag as part of the discovery state.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-05 17:02:50 +01:00
Johan Hedberg 0d3b7f64c8 Bluetooth: Change eir_has_data_type() to more generic eir_get_data()
To make the EIR parsing helper more general purpose, make it return
the found data and its length rather than just saying whether the data
was present or not.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2016-01-05 17:02:49 +01:00