Commit Graph

1659 Commits

Author SHA1 Message Date
David S. Miller fba961ab29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes.  Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.

Include a necessary change by Jakub Kicinski, with log message:

====================
cls_bpf no longer takes care of offload tracking.  Make sure
netdevsim performs necessary checks.  This fixes a warning
caused by TC trying to remove a filter it has not added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22 11:16:31 -05:00
Jon Maloy bb25c3855a tipc: remove joining group member from congested list
When we receive a JOIN message from a peer member, the message may
contain an advertised window value ADV_IDLE that permits removing the
member in question from the tipc_group::congested list. However, since
the removal has been made conditional on that the advertised window is
*not* ADV_IDLE, we miss this case. This has the effect that a sender
sometimes may enter a state of permanent, false, broadcast congestion.

We fix this by unconditinally removing the member from the congested
list before calling tipc_member_update(), which might potentially sort
it into the list again.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20 14:56:48 -05:00
Jon Maloy 3db0960117 tipc: fix list sorting bug in function tipc_group_update_member()
When, during a join operation, or during message transmission, a group
member needs to be added to the group's 'congested' list, we sort it
into the list in ascending order, according to its current advertised
window size. However, we miss the case when the member is already on
that list. This will have the result that the member, after the window
size has been decremented, might be at the wrong position in that list.
This again may have the effect that we during broadcast and multicast
transmissions miss the fact that a destination is not yet ready for
reception, and we end up sending anyway. From this point on, the
behavior during the remaining session is unpredictable, e.g., with
underflowing window sizes.

We now correct this bug by unconditionally removing the member from
the list before (re-)sorting it in.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19 14:10:03 -05:00
Jon Maloy 3f42f5fe31 tipc: remove leaving group member from all lists
A group member going into state LEAVING should never go back to any
other state before it is finally deleted. However, this might happen
if the socket needs to send out a RECLAIM message during this interval.
Since we forget to remove the leaving member from the group's 'active'
or 'pending' list, the member might be selected for reclaiming, change
state to RECLAIMING, and get stuck in this state instead of being
deleted. This might lead to suppression of the expected 'member down'
event to the receiver.

We fix this by removing the member from all lists, except the RB tree,
at the moment it goes into state LEAVING.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 13:16:40 -05:00
Jon Maloy 234833991e tipc: fix lost member events bug
Group messages are not supposed to be returned to sender when the
destination socket disappears. This is done correctly for regular
traffic messages, by setting the 'dest_droppable' bit in the header.
But we forget to do that in group protocol messages. This has the effect
that such messages may sometimes bounce back to the sender, be perceived
as a legitimate peer message, and wreak general havoc for the rest of
the session. In particular, we have seen that a member in state LEAVING
may go back to state RECLAIMED or REMITTED, hence causing suppression
of an otherwise expected 'member down' event to the user.

We fix this by setting the 'dest_droppable' bit even in group protocol
messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 13:16:40 -05:00
David S. Miller c30abd5e40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-16 22:11:55 -05:00
Jon Maloy c545a945d0 tipc: eliminate potential memory leak
In the function tipc_sk_mcast_rcv() we call refcount_dec(&skb->users)
on received sk_buffers. Since the reference counter might hit zero at
this point, we have a potential memory leak.

We fix this by replacing refcount_dec() with kfree_skb().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13 13:44:36 -05:00
Tom Herbert 97a6ec4ac0 rhashtable: Change rhashtable_walk_start to return void
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:

       ret = rhashtable_walk_start(rhiter);
       if (ret && ret != -EAGAIN)
               goto out;

Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.

This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11 09:58:38 -05:00
David S. Miller 51e18a453f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-09 22:09:55 -05:00
Jon Maloy a7d5f107b4 tipc: fix memory leak in tipc_accept_from_sock()
When the function tipc_accept_from_sock() fails to create an instance of
struct tipc_subscriber it omits to free the already created instance of
struct tipc_conn instance before it returns.

We fix that with this commit.

Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 14:52:41 -05:00
Cong Wang 672ecbe1c9 tipc: fix a null pointer deref on error path
In tipc_topsrv_kern_subscr() when s->tipc_conn_new() fails
we call tipc_close_conn() to clean up, but in this case
calling conn_put() is just enough.

This fixes the folllowing crash:

 kasan: GPF could be caused by NULL-ptr deref or user memory access
 general protection fault: 0000 [#1] SMP KASAN
 Dumping ftrace buffer:
    (ftrace buffer empty)
 Modules linked in:
 CPU: 0 PID: 3085 Comm: syzkaller064164 Not tainted 4.15.0-rc1+ #137
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 task: 00000000c24413a5 task.stack: 000000005e8160b5
 RIP: 0010:__lock_acquire+0xd55/0x47f0 kernel/locking/lockdep.c:3378
 RSP: 0018:ffff8801cb5474a8 EFLAGS: 00010002
 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff85ecb400
 RBP: ffff8801cb547830 R08: 0000000000000001 R09: 0000000000000000
 R10: 0000000000000000 R11: ffffffff87489d60 R12: ffff8801cd2980c0
 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020
 FS:  00000000014ee880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffee2426e40 CR3: 00000001cb85a000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:4004
  __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
  _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:175
  spin_lock_bh include/linux/spinlock.h:320 [inline]
  tipc_subscrb_subscrp_delete+0x8f/0x470 net/tipc/subscr.c:201
  tipc_subscrb_delete net/tipc/subscr.c:238 [inline]
  tipc_subscrb_release_cb+0x17/0x30 net/tipc/subscr.c:316
  tipc_close_conn+0x171/0x270 net/tipc/server.c:204
  tipc_topsrv_kern_subscr+0x724/0x810 net/tipc/server.c:514
  tipc_group_create+0x702/0x9c0 net/tipc/group.c:184
  tipc_sk_join net/tipc/socket.c:2747 [inline]
  tipc_setsockopt+0x249/0xc10 net/tipc/socket.c:2861
  SYSC_setsockopt net/socket.c:1851 [inline]
  SyS_setsockopt+0x189/0x360 net/socket.c:1830
  entry_SYSCALL_64_fastpath+0x1f/0x96

Fixes: 14c04493cb ("tipc: add ability to order and receive topology events in driver")
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 14:51:59 -05:00
David S. Miller 7cda4cee13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 10:44:19 -05:00
Al Viro bc4802736d tipc: switch to sock_recvmsg()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-02 20:38:10 -05:00
Jon Maloy 4c94cc2d3d tipc: fall back to smaller MTU if allocation of local send skb fails
When sending node local messages the code is using an 'mtu' of 66060
bytes to avoid unnecessary fragmentation. During situations of low
memory tipc_msg_build() may sometimes fail to allocate such large
buffers, resulting in unnecessary send failures. This can easily be
remedied by falling back to a smaller MTU, and then reassemble the
buffer chain as if the message were arriving from a remote node.

At the same time, we change the initial MTU setting of the broadcast
link to a lower value, so that large messages always are fragmented
into smaller buffers even when we run in single node mode. Apart from
obtaining the same advantage as for the 'fallback' solution above, this
turns out to give a significant performance improvement. This can
probably be explained with the __pskb_copy() operation performed on the
buffer for each recipient during reception. We found the optimal value
for this, considering the most relevant skb pool, to be 3744 bytes.

Acked-by: Ying Xue <ying.xue@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-01 15:21:25 -05:00
Tommi Rantala c7799c067c tipc: call tipc_rcv() only if bearer is up in tipc_udp_recv()
Remove the second tipc_rcv() call in tipc_udp_recv(). We have just
checked that the bearer is not up, and calling tipc_rcv() with a bearer
that is not up leads to a TIPC div-by-zero crash in
tipc_node_calculate_timer(). The crash is rare in practice, but can
happen like this:

  We're enabling a bearer, but it's not yet up and fully initialized.
  At the same time we receive a discovery packet, and in tipc_udp_recv()
  we end up calling tipc_rcv() with the not-yet-initialized bearer,
  causing later the div-by-zero crash in tipc_node_calculate_timer().

Jon Maloy explains the impact of removing the second tipc_rcv() call:
  "link setup in the worst case will be delayed until the next arriving
   discovery messages, 1 sec later, and this is an acceptable delay."

As the tipc_rcv() call is removed, just leave the function via the
rcu_out label, so that we will kfree_skb().

[   12.590450] Own node address <1.1.1>, network identity 1
[   12.668088] divide error: 0000 [#1] SMP
[   12.676952] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.14.2-dirty #1
[   12.679225] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
[   12.682095] task: ffff8c2a761edb80 task.stack: ffffa41cc0cac000
[   12.684087] RIP: 0010:tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc]
[   12.686486] RSP: 0018:ffff8c2a7fc838a0 EFLAGS: 00010246
[   12.688451] RAX: 0000000000000000 RBX: ffff8c2a5b382600 RCX: 0000000000000000
[   12.691197] RDX: 0000000000000000 RSI: ffff8c2a5b382600 RDI: ffff8c2a5b382600
[   12.693945] RBP: ffff8c2a7fc838b0 R08: 0000000000000001 R09: 0000000000000001
[   12.696632] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c2a5d8949d8
[   12.699491] R13: ffffffff95ede400 R14: 0000000000000000 R15: ffff8c2a5d894800
[   12.702338] FS:  0000000000000000(0000) GS:ffff8c2a7fc80000(0000) knlGS:0000000000000000
[   12.705099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.706776] CR2: 0000000001bb9440 CR3: 00000000bd009001 CR4: 00000000003606e0
[   12.708847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   12.711016] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   12.712627] Call Trace:
[   12.713390]  <IRQ>
[   12.714011]  tipc_node_check_dest+0x2e8/0x350 [tipc]
[   12.715286]  tipc_disc_rcv+0x14d/0x1d0 [tipc]
[   12.716370]  tipc_rcv+0x8b0/0xd40 [tipc]
[   12.717396]  ? minmax_running_min+0x2f/0x60
[   12.718248]  ? dst_alloc+0x4c/0xa0
[   12.718964]  ? tcp_ack+0xaf1/0x10b0
[   12.719658]  ? tipc_udp_is_known_peer+0xa0/0xa0 [tipc]
[   12.720634]  tipc_udp_recv+0x71/0x1d0 [tipc]
[   12.721459]  ? dst_alloc+0x4c/0xa0
[   12.722130]  udp_queue_rcv_skb+0x264/0x490
[   12.722924]  __udp4_lib_rcv+0x21e/0x990
[   12.723670]  ? ip_route_input_rcu+0x2dd/0xbf0
[   12.724442]  ? tcp_v4_rcv+0x958/0xa40
[   12.725039]  udp_rcv+0x1a/0x20
[   12.725587]  ip_local_deliver_finish+0x97/0x1d0
[   12.726323]  ip_local_deliver+0xaf/0xc0
[   12.726959]  ? ip_route_input_noref+0x19/0x20
[   12.727689]  ip_rcv_finish+0xdd/0x3b0
[   12.728307]  ip_rcv+0x2ac/0x360
[   12.728839]  __netif_receive_skb_core+0x6fb/0xa90
[   12.729580]  ? udp4_gro_receive+0x1a7/0x2c0
[   12.730274]  __netif_receive_skb+0x1d/0x60
[   12.730953]  ? __netif_receive_skb+0x1d/0x60
[   12.731637]  netif_receive_skb_internal+0x37/0xd0
[   12.732371]  napi_gro_receive+0xc7/0xf0
[   12.732920]  receive_buf+0x3c3/0xd40
[   12.733441]  virtnet_poll+0xb1/0x250
[   12.733944]  net_rx_action+0x23e/0x370
[   12.734476]  __do_softirq+0xc5/0x2f8
[   12.734922]  irq_exit+0xfa/0x100
[   12.735315]  do_IRQ+0x4f/0xd0
[   12.735680]  common_interrupt+0xa2/0xa2
[   12.736126]  </IRQ>
[   12.736416] RIP: 0010:native_safe_halt+0x6/0x10
[   12.736925] RSP: 0018:ffffa41cc0cafe90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff4d
[   12.737756] RAX: 0000000000000000 RBX: ffff8c2a761edb80 RCX: 0000000000000000
[   12.738504] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   12.739258] RBP: ffffa41cc0cafe90 R08: 0000014b5b9795e5 R09: ffffa41cc12c7e88
[   12.740118] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[   12.740964] R13: ffff8c2a761edb80 R14: 0000000000000000 R15: 0000000000000000
[   12.741831]  default_idle+0x2a/0x100
[   12.742323]  arch_cpu_idle+0xf/0x20
[   12.742796]  default_idle_call+0x28/0x40
[   12.743312]  do_idle+0x179/0x1f0
[   12.743761]  cpu_startup_entry+0x1d/0x20
[   12.744291]  start_secondary+0x112/0x120
[   12.744816]  secondary_startup_64+0xa5/0xa5
[   12.745367] Code: b9 f4 01 00 00 48 89 c2 48 c1 ea 02 48 3d d3 07 00
00 48 0f 47 d1 49 8b 0c 24 48 39 d1 76 07 49 89 14 24 48 89 d1 31 d2 48
89 df <48> f7 f1 89 c6 e8 81 6e ff ff 5b 41 5c 5d c3 66 90 66 2e 0f 1f
[   12.747527] RIP: tipc_node_calculate_timer.isra.12+0x45/0x60 [tipc] RSP: ffff8c2a7fc838a0
[   12.748555] ---[ end trace 1399ab83390650fd ]---
[   12.749296] Kernel panic - not syncing: Fatal exception in interrupt
[   12.750123] Kernel Offset: 0x13200000 from 0xffffffff82000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   12.751215] Rebooting in 60 seconds..

Fixes: c9b64d492b ("tipc: add replicast peer discovery")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-01 15:14:22 -05:00
Al Viro ade994f4f6 net: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:20:04 -05:00
Jon Maloy 2e724dca77 tipc: eliminate access after delete in group_filter_msg()
KASAN revealed another access after delete in group.c. This time
it found that we read the header of a received message after the
buffer has been released.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-27 14:44:45 -05:00
Jon Maloy e0e853ac03 tipc: fix access of released memory
When the function tipc_group_filter_msg() finds that a member event
indicates that the member is leaving the group, it first deletes the
member instance, and then purges the message queue being handled
by the call. But the message queue is an aggregated field in the
just deleted item, leading the purge call to access freed memory.

We fix this by swapping the order of the two actions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-21 20:22:03 +09:00
Jon Maloy d618d09a68 tipc: enforce valid ratio between skb truesize and contents
The socket level flow control is based on the assumption that incoming
buffers meet the condition (skb->truesize / roundup(skb->len) <= 4),
where the latter value is rounded off upwards to the nearest 1k number.
This does empirically hold true for the device drivers we know, but we
cannot trust that it will always be so, e.g., in a system with jumbo
frames and very small packets.

We now introduce a check for this condition at packet arrival, and if
we find it to be false, we copy the packet to a new, smaller buffer,
where the condition will be true. We expect this to affect only a small
fraction of all incoming packets, if at all.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-16 10:49:00 +09:00
Jon Maloy 8d6e79d3ce tipc: improve link resiliency when rps is activated
Currently, the TIPC RPS dissector is based only on the incoming packets'
source node address, hence steering all traffic from a node to the same
core. We have seen that this makes the links vulnerable to starvation
and unnecessary resets when we turn down the link tolerance to very low
values.

To reduce the risk of this happening, we exempt probe and probe replies
packets from the convergence to one core per source node. Instead, we do
the opposite, - we try to diverge those packets across as many cores as
possible, by randomizing the flow selector key.

To make such packets identifiable to the dissector, we add a new
'is_keepalive' bit to word 0 of the LINK_PROTOCOL header. This bit is
set both for PROBE and PROBE_REPLY messages, and only for those.

It should be noted that these packets are not part of any flow anyway,
and only constitute a minuscule fraction of all packets sent across a
link. Hence, there is no risk that this will affect overall performance.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-11 15:36:05 +09:00
David S. Miller 2a171788ba Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Files removed in 'net-next' had their license header updated
in 'net'.  We take the remove from 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-04 09:26:51 +09:00
Jon Maloy fa36882682 tipc: eliminate unnecessary probing
The neighbor monitor employs a threshold, default set to 32 peer nodes,
where it activates the "Overlapping Neighbor Monitoring" algorithm.
Below that threshold, monitoring is full-mesh, and no "domain records"
are passed between the nodes.

Because of this, a node never received a peer's ack that it has received
the most recent update of the own domain. Hence, the field 'acked_gen'
in struct tipc_monitor_state remains permamently at zero, whereas the
own domain generation is incremented for each added or removed peer.

This has the effect that the function tipc_mon_get_state() always sets
the field 'probing' in struct tipc_monitor_state true, again leading the
tipc_link_timeout() of the link in question to always send out a probe,
even when link->silent_intv_count is zero.

This is functionally harmless, but leads to some unncessary probing,
which can easily be eliminated by setting the 'probing' field of the
said struct correctly in such cases.

At the same time, we explictly invalidate the sent domain records when
the algorithm is not activated. This will eliminate any risk that an
invalid domain record might be inadverently accepted by the peer.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-03 15:48:46 +09:00
Greg Kroah-Hartman b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Kees Cook 31b102bb50 net: tipc: Convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: tipc-discussion@lists.sourceforge.net
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-01 12:38:45 +09:00
Cong Wang e233df0157 tipc: fix a dangling pointer
tsk->group is set to grp earlier, but we forget to unset it
after grp is freed.

Fixes: 75da2163db ("tipc: introduce communication groups")
Reported-by: syzkaller bot
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 17:46:53 +09:00
Jon Maloy f65163fed0 tipc: eliminate KASAN warning
The following warning was reported by syzbot on Oct 24. 2017:
KASAN: slab-out-of-bounds Read in tipc_nametbl_lookup_dst_nodes

This is a harmless bug, but we still want to get rid of the warning,
so we swap the two conditions in question.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-26 09:38:34 +09:00
Jon Maloy 0d5fcebf3c tipc: refactor tipc_sk_timeout() function
The function tipc_sk_timeout() is more complex than necessary, and
even seems to contain an undetected bug. At one of the occurences
where we renew the timer we just order it with (HZ / 20), instead
of (jiffies + HZ / 20);

In this commit we clean up the function.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-22 02:36:35 +01:00
Jon Maloy cb4dc41eaa tipc: fix broken tipc_poll() function
In commit ae236fb208 ("tipc: receive group membership events via
member socket") we broke the tipc_poll() function by checking the
state of the receive queue before the call to poll_sock_wait(), while
relying that state afterwards, when it might have changed.

We restore this in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-21 12:27:05 +01:00
Dan Carpenter c75e427d93 tipc: checking for NULL instead of IS_ERR()
The tipc_alloc_conn() function never returns NULL, it returns error
pointers, so I have fixed the check.

Fixes: 14c04493cb ("tipc: add ability to order and receive topology events in driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-20 08:34:00 +01:00
Jon Maloy 36c0a9dfc6 tipc: fix rebasing error
In commit 2f487712b8 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") there was introduced a last-minute rebasing
error that broke non-group communication.

We fix this here.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-16 21:28:36 +01:00
Jon Maloy 04d7b574b2 tipc: add multipoint-to-point flow control
We already have point-to-multipoint flow control within a group. But
we even need the opposite; -a scheme which can handle that potentially
hundreds of sources may try to send messages to the same destination
simultaneously without causing buffer overflow at the recipient. This
commit adds such a mechanism.

The algorithm works as follows:

- When a member detects a new, joining member, it initially set its
  state to JOINED and advertises a minimum window to the new member.
  This window is chosen so that the new member can send exactly one
  maximum sized message, or several smaller ones, to the recipient
  before it must stop and wait for an additional advertisement. This
  minimum window ADV_IDLE is set to 65 1kB blocks.

- When a member receives the first data message from a JOINED member,
  it changes the state of the latter to ACTIVE, and advertises a larger
  window ADV_ACTIVE = 12 x ADV_IDLE blocks to the sender, so it can
  continue sending with minimal disturbances to the data flow.

- The active members are kept in a dedicated linked list. Each time a
  message is received from an active member, it will be moved to the
  tail of that list. This way, we keep a record of which members have
  been most (tail) and least (head) recently active.

- There is a maximum number (16) of permitted simultaneous active
  senders per receiver. When this limit is reached, the receiver will
  not advertise anything immediately to a new sender, but instead put
  it in a PENDING state, and add it to a corresponding queue. At the
  same time, it will pick the least recently active member, send it an
  advertisement RECLAIM message, and set this member to state
  RECLAIMING.

- The reclaimee member has to respond with a REMIT message, meaning that
  it goes back to a send window of ADV_IDLE, and returns its unused
  advertised blocks beyond that value to the reclaiming member.

- When the reclaiming member receives the REMIT message, it unlinks
  the reclaimee from its active list, resets its state to JOINED, and
  notes that it is now back at ADV_IDLE advertised blocks to that
  member. If there are still unread data messages sent out by
  reclaimee before the REMIT, the member goes into an intermediate
  state REMITTED, where it stays until the said messages have been
  consumed.

- The returned advertised blocks can now be re-advertised to the
  pending member, which is now set to state ACTIVE and added to
  the active member list.

- To be proactive, i.e., to minimize the risk that any member will
  end up in the pending queue, we start reclaiming resources already
  when the number of active members exceeds 3/4 of the permitted
  maximum.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy a3bada7066 tipc: guarantee delivery of last broadcast before DOWN event
The following scenario is possible:
- A user sends a broadcast message, and thereafter immediately leaves
  the group.
- The LEAVE message, following a different path than the broadcast,
  arrives ahead of the broadcast, and the sending member is removed
  from the receiver's list.
- The broadcast message arrives, but is dropped because the sender
  now is unknown to the receipient.

We fix this by sequence numbering membership events, just like ordinary
unicast messages. Currently, when a JOIN is sent to a peer, it contains
a synchronization point, - the sequence number of the next sent
broadcast, in order to give the receiver a start synchronization point.
We now let even LEAVE messages contain such an "end synchronization"
point, so that the recipient can delay the removal of the sending member
until it knows that all messages have been received.

The received synchronization points are added as sequence numbers to the
generated membership events, making it possible to handle them almost
the same way as regular unicasts in the receiving filter function. In
particular, a DOWN event with a too high sequence number will be kept
in the reordering queue until the missing broadcast(s) arrive and have
been delivered.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy 399574d419 tipc: guarantee delivery of UP event before first broadcast
The following scenario is possible:
- A user joins a group, and immediately sends out a broadcast message
  to its members.
- The broadcast message, following a different data path than the
  initial JOIN message sent out during the joining procedure, arrives
  to a receiver before the latter..
- The receiver drops the message, since it is not ready to accept any
  messages until the JOIN has arrived.

We avoid this by treating group protocol JOIN messages like unicast
messages.
- We let them pass through the recipient's multicast input queue, just
  like ordinary unicasts.
- We force the first following broadacst to be sent as replicated
  unicast and being acknowledged by the recipient before accepting
  any more broadcast transmissions.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy 2f487712b8 tipc: guarantee that group broadcast doesn't bypass group unicast
We need a mechanism guaranteeing that group unicasts sent out from a
socket are not bypassed by later sent broadcasts from the same socket.
We do this as follows:

- Each time a unicast is sent, we set a the broadcast method for the
  socket to "replicast" and "mandatory". This forces the first
  subsequent broadcast message to follow the same network and data path
  as the preceding unicast to a destination, hence preventing it from
  overtaking the latter.

- In order to make the 'same data path' statement above true, we let
  group unicasts pass through the multicast link input queue, instead
  of as previously through the unicast link input queue.

- In the first broadcast following a unicast, we set a new header flag,
  requiring all recipients to immediately acknowledge its reception.

- During the period before all the expected acknowledges are received,
  the socket refuses to accept any more broadcast attempts, i.e., by
  blocking or returning EAGAIN. This period should typically not be
  longer than a few microseconds.

- When all acknowledges have been received, the sending socket will
  open up for subsequent broadcasts, this time giving the link layer
  freedom to itself select the best transmission method.

- The forced and/or abrupt transmission method changes described above
  may lead to broadcasts arriving out of order to the recipients. We
  remedy this by introducing code that checks and if necessary
  re-orders such messages at the receiving end.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy b87a5ea31c tipc: guarantee group unicast doesn't bypass group broadcast
Group unicast messages don't follow the same path as broadcast messages,
and there is a high risk that unicasts sent from a socket might bypass
previously sent broadcasts from the same socket.

We fix this by letting all unicast messages carry the sequence number of
the next sent broadcast from the same node, but without updating this
number at the receiver. This way, a receiver can check and if necessary
re-order such messages before they are added to the socket receive buffer.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy 5b8dddb637 tipc: introduce group multicast messaging
The previously introduced message transport to all group members is
based on the tipc multicast service, but is logically a broadcast
service within the group, and that is what we call it.

We now add functionality for sending messages to all group members
having a certain identity. Correspondingly, we call this feature 'group
multicast'. The service is using unicast when only one destination is
found, otherwise it will use the bearer broadcast service to transfer
the messages. In the latter case, the receiving members filter arriving
messages by looking at the intended destination instance. If there is
no match, the message will be dropped, while still being considered
received and read as seen by the flow control mechanism.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:01 -07:00
Jon Maloy ee106d7f94 tipc: introduce group anycast messaging
In this commit, we make it possible to send connectionless unicast
messages to any member corresponding to the given member identity,
when there is more than one such member. The sender must use a
TIPC_ADDR_NAME address to achieve this effect.

We also perform load balancing between the destinations, i.e., we
primarily select one which has advertised sufficient send window
to not cause a block/EAGAIN delay, if any. This mechanism is
overlayed on the always present round-robin selection.

Anycast messages are subject to the same start synchronization
and flow control mechanism as group broadcast messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 27bd9ec027 tipc: introduce group unicast messaging
We now make it possible to send connectionless unicast messages
within a communication group. To send a message, the sender can use
either a direct port address, aka port identity, or an indirect port
name to be looked up.

This type of messages are subject to the same start synchronization
and flow control mechanism as group broadcast messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy b7d4263551 tipc: introduce flow control for group broadcast messages
We introduce an end-to-end flow control mechanism for group broadcast
messages. This ensures that no messages are ever lost because of
destination receive buffer overflow, with minimal impact on performance.
For now, the algorithm is based on the assumption that there is only one
active transmitter at any moment in time.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy ae236fb208 tipc: receive group membership events via member socket
Like with any other service, group members' availability can be
subscribed for by connecting to be topology server. However, because
the events arrive via a different socket than the member socket, there
is a real risk that membership events my arrive out of synch with the
actual JOIN/LEAVE action. I.e., it is possible to receive the first
messages from a new member before the corresponding JOIN event arrives,
just as it is possible to receive the last messages from a leaving
member after the LEAVE event has already been received.

Since each member socket is internally also subscribing for membership
events, we now fix this problem by passing those events on to the user
via the member socket. We leverage the already present member synch-
ronization protocol to guarantee correct message/event order. An event
is delivered to the user as an empty message where the two source
addresses identify the new/lost member. Furthermore, we set the MSG_OOB
bit in the message flags to mark it as an event. If the event is an
indication about a member loss we also set the MSG_EOR bit, so it can
be distinguished from a member addition event.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 31c82a2d9d tipc: add second source address to recvmsg()/recvfrom()
With group communication, it becomes important for a message receiver to
identify not only from which socket (identfied by a node:port tuple) the
message was sent, but also the logical identity (type:instance) of the
sending member.

We fix this by adding a second instance of struct sockaddr_tipc to the
source address area when a message is read. The extra address struct
is filled in with data found in the received message header (type,) and
in the local member representation struct (instance.)

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 75da2163db tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.

We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.

A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.

Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy a80ae5306a tipc: improve destination linked list
We often see a need for a linked list of destination identities,
sometimes containing a port number, sometimes a node identity, and
sometimes both. The currently defined struct u32_list is not generic
enough to cover all cases, so we extend it to contain two u32 integers
and rename it to struct tipc_dest_list.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy f70d37b796 tipc: add new function for sending multiple small messages
We see an increasing need to send multiple single-buffer messages
of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes.
Instead of looping over the send queue and sending each buffer
individually, as we do now, we add a new help function
tipc_node_distr_xmit() to do this.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 64ac5f5977 tipc: refactor function filter_rcv()
In the following commits we will need to handle multiple incoming and
rejected/returned buffers in the function socket.c::filter_rcv().
As a preparation for this, we generalize the function by handling
buffer queues instead of individual buffers. We also introduce a
help function tipc_skb_reject(), and rename filter_rcv() to
tipc_sk_filter_rcv() in line with other functions in socket.c.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 38077b8ef8 tipc: add ability to obtain node availability status from other files
In the coming commits, functions at the socket level will need the
ability to read the availability status of a given node. We therefore
introduce a new function for this purpose, while renaming the existing
static function currently having the wanted name.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 23998835be tipc: improve address sanity check in tipc_connect()
The address given to tipc_connect() is not completely sanity checked,
under the assumption that this will be done later in the function
__tipc_sendmsg() when the address is used there.

However, the latter functon will in the next commits serve as caller
to several other send functions, so we want to move the corresponding
sanity check there to the beginning of that function, before we possibly
need to grab the address stored by tipc_connect(). We must therefore
be able to trust that this address already has been thoroughly checked.

We do this in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy 14c04493cb tipc: add ability to order and receive topology events in driver
As preparation for introducing communication groups, we add the ability
to issue topology subscriptions and receive topology events from kernel
space. This will make it possible for group member sockets to keep track
of other group members.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 08:46:00 -07:00
Jon Maloy a9e2971b8c tipc: Unclone message at secondary destination lookup
When a bundling message is received, the function tipc_link_input()
calls function tipc_msg_extract() to unbundle all inner messages of
the bundling message before adding them to input queue.

The function tipc_msg_extract() just clones all inner skb for all
inner messagges from the bundling skb. This means that the skb
headroom of an inner message overlaps with the data part of the
preceding message in the bundle.

If the message in question is a name addressed message, it may be
subject to a secondary destination lookup, and eventually be sent out
on one of the interfaces again. But, since what is perceived as headroom
by the device driver in reality is the last bytes of the preceding
message in the bundle, the latter will be overwritten by the MAC
addresses of the L2 header. If the preceding message has not yet been
consumed by the user, it will evenually be delivered with corrupted
contents.

This commit fixes this by uncloning all messages passing through the
function tipc_msg_lookup_dest(), hence ensuring that the headroom
is always valid when the message is passed on.

Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08 21:13:23 -07:00
Jon Maloy 3382605fd8 tipc: correct initialization of skb list
We change the initialization of the skb transmit buffer queues
in the functions tipc_bcast_xmit() and tipc_rcast_xmit() to also
initialize their spinlocks. This is needed because we may, during
error conditions, need to call skb_queue_purge() on those queues
further down the stack.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-08 21:13:23 -07:00
Parthasarathy Bhuvaragan aad06212d3 tipc: use only positive error codes in messages
In commit e3a77561e7 ("tipc: split up function tipc_msg_eval()"),
we have updated the function tipc_msg_lookup_dest() to set the error
codes to negative values at destination lookup failures. Thus when
the function sets the error code to -TIPC_ERR_NO_NAME, its inserted
into the 4 bit error field of the message header as 0xf instead of
TIPC_ERR_NO_NAME (1). The value 0xf is an unknown error code.

In this commit, we set only positive error code.

Fixes: e3a77561e7 ("tipc: split up function tipc_msg_eval()")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-01 04:03:35 +01:00
Kleber Sacilotto de Souza 8e0deed924 tipc: remove unnecessary call to dev_net()
The net device is already stored in the 'net' variable, so no need to call
dev_net() again.

Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-06 21:25:52 -07:00
David S. Miller 6026e043d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01 17:42:05 -07:00
Parthasarathy Bhuvaragan d55c60eba0 tipc: permit bond slave as bearer
For a bond slave device as a tipc bearer, the dev represents the bond
interface and orig_dev represents the slave in tipc_l2_rcv_msg().
Since we decode the tipc_ptr from bonding device (dev), we fail to
find the bearer and thus tipc links are not established.

In this commit, we register the tipc protocol callback per device and
look for tipc bearer from both the devices.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29 15:07:33 -07:00
Parthasarathy Bhuvaragan 991ca84daa tipc: context imbalance at node read unlock
If we fail to find a valid bearer in tipc_node_get_linkname(),
node_read_unlock() is called without holding the node read lock.

This commit fixes this error.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:54:34 -07:00
Parthasarathy Bhuvaragan 60d1d93664 tipc: reassign pointers after skb reallocation / linearization
In tipc_msg_reverse(), we assign skb attributes to local pointers
in stack at startup. This is followed by skb_linearize() and for
cloned buffers we perform skb relocation using pskb_expand_head().
Both these methods may update the skb attributes and thus making
the pointers incorrect.

In this commit, we fix this error by ensuring that the pointers
are re-assigned after any of these skb operations.

Fixes: 29042e19f2 ("tipc: let function tipc_msg_reverse() expand header
when needed")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:54:34 -07:00
Parthasarathy Bhuvaragan 27163138b4 tipc: perform skb_linearize() before parsing the inner header
In tipc_rcv(), we linearize only the header and usually the packets
are consumed as the nodes permit direct reception. However, if the
skb contains tunnelled message due to fail over or synchronization
we parse it in tipc_node_check_state() without performing
linearization. This will cause link disturbances if the skb was
non linear.

In this commit, we perform linearization for the above messages.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 21:54:34 -07:00
Bob Peterson 6c7e983b22 tipc: Fix tipc_sk_reinit handling of -EAGAIN
In 9dbbfb0ab6 function tipc_sk_reinit
had additional logic added to loop in the event that function
rhashtable_walk_next() returned -EAGAIN. No worries.

However, if rhashtable_walk_start returns -EAGAIN, it does "continue",
and therefore skips the call to rhashtable_walk_stop(). That has
the effect of calling rcu_read_lock() without its paired call to
rcu_read_unlock(). Since rcu_read_lock() may be nested, the problem
may not be apparent for a while, especially since resize events may
be rare. But the comments to rhashtable_walk_start() state:

 * ...Note that we take the RCU lock in all
 * cases including when we return an error.  So you must always call
 * rhashtable_walk_stop to clean up.

This patch replaces the continue with a goto and label to ensure a
matching call to rhashtable_walk_stop().

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-24 14:02:26 -07:00
Arvind Yadav 042a90106b net: tipc: constify genl_ops
genl_ops are not supposed to change at runtime. All functions
working with genl_ops provided by <net/genetlink.h> work with
const genl_ops. So mark the non-const structs as const.

Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-23 22:31:38 -07:00
Ying Xue fd849b7c41 tipc: fix a race condition of releasing subscriber object
No matter whether a request is inserted into workqueue as a work item
to cancel a subscription or to delete a subscription's subscriber
asynchronously, the work items may be executed in different workers.
As a result, it doesn't mean that one request which is raised prior to
another request is definitely handled before the latter. By contrast,
if the latter request is executed before the former request, below
error may happen:

[  656.183644] BUG: spinlock bad magic on CPU#0, kworker/u8:0/12117
[  656.184487] general protection fault: 0000 [#1] SMP
[  656.185160] Modules linked in: tipc ip6_udp_tunnel udp_tunnel 9pnet_virtio 9p 9pnet virtio_net virtio_pci virtio_ring virtio [last unloaded: ip6_udp_tunnel]
[  656.187003] CPU: 0 PID: 12117 Comm: kworker/u8:0 Not tainted 4.11.0-rc7+ #6
[  656.187920] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  656.188690] Workqueue: tipc_rcv tipc_recv_work [tipc]
[  656.189371] task: ffff88003f5cec40 task.stack: ffffc90004448000
[  656.190157] RIP: 0010:spin_bug+0xdd/0xf0
[  656.190678] RSP: 0018:ffffc9000444bcb8 EFLAGS: 00010202
[  656.191375] RAX: 0000000000000034 RBX: ffff88003f8d1388 RCX: 0000000000000000
[  656.192321] RDX: ffff88003ba13708 RSI: ffff88003ba0cd08 RDI: ffff88003ba0cd08
[  656.193265] RBP: ffffc9000444bcd0 R08: 0000000000000030 R09: 000000006b6b6b6b
[  656.194208] R10: ffff8800bde3e000 R11: 00000000000001b4 R12: 6b6b6b6b6b6b6b6b
[  656.195157] R13: ffffffff81a3ca64 R14: ffff88003f8d1388 R15: ffff88003f8d13a0
[  656.196101] FS:  0000000000000000(0000) GS:ffff88003ba00000(0000) knlGS:0000000000000000
[  656.197172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  656.197935] CR2: 00007f0b3d2e6000 CR3: 000000003ef9e000 CR4: 00000000000006f0
[  656.198873] Call Trace:
[  656.199210]  do_raw_spin_lock+0x66/0xa0
[  656.199735]  _raw_spin_lock_bh+0x19/0x20
[  656.200258]  tipc_subscrb_subscrp_delete+0x28/0xf0 [tipc]
[  656.200990]  tipc_subscrb_rcv_cb+0x45/0x260 [tipc]
[  656.201632]  tipc_receive_from_sock+0xaf/0x100 [tipc]
[  656.202299]  tipc_recv_work+0x2b/0x60 [tipc]
[  656.202872]  process_one_work+0x157/0x420
[  656.203404]  worker_thread+0x69/0x4c0
[  656.203898]  kthread+0x138/0x170
[  656.204328]  ? process_one_work+0x420/0x420
[  656.204889]  ? kthread_create_on_node+0x40/0x40
[  656.205527]  ret_from_fork+0x29/0x40
[  656.206012] Code: 48 8b 0c 25 00 c5 00 00 48 c7 c7 f0 24 a3 81 48 81 c1 f0 05 00 00 65 8b 15 61 ef f5 7e e8 9a 4c 09 00 4d 85 e4 44 8b 4b 08 74 92 <45> 8b 84 24 40 04 00 00 49 8d 8c 24 f0 05 00 00 eb 8d 90 0f 1f
[  656.208504] RIP: spin_bug+0xdd/0xf0 RSP: ffffc9000444bcb8
[  656.209798] ---[ end trace e2a800e6eb0770be ]---

In above scenario, the request of deleting subscriber was performed
earlier than the request of canceling a subscription although the
latter was issued before the former, which means tipc_subscrb_delete()
was called before tipc_subscrp_cancel(). As a result, when
tipc_subscrb_subscrp_delete() called by tipc_subscrp_cancel() was
executed to cancel a subscription, the subscription's subscriber
refcnt had been decreased to 1. After tipc_subscrp_delete() where
the subscriber was freed because its refcnt was decremented to zero,
but the subscriber's lock had to be released, as a consequence, panic
happened.

By contrast, if we increase subscriber's refcnt before
tipc_subscrb_subscrp_delete() is called in tipc_subscrp_cancel(),
the panic issue can be avoided.

Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:25:02 -07:00
Parthasarathy Bhuvaragan 458be024ef tipc: remove subscription references only for pending timers
In commit, 139bb36f75 ("tipc: advance the time of deleting
subscription from subscriber->subscrp_list"), we delete the
subscription from the subscribers list and from nametable
unconditionally. This leads to the following bug if the timer
running tipc_subscrp_timeout() in another CPU accesses the
subscription list after the subscription delete request.

[39.570] general protection fault: 0000 [#1] SMP
::
[39.574] task: ffffffff81c10540 task.stack: ffffffff81c00000
[39.575] RIP: 0010:tipc_subscrp_timeout+0x32/0x80 [tipc]
[39.576] RSP: 0018:ffff88003ba03e90 EFLAGS: 00010282
[39.576] RAX: dead000000000200 RBX: ffff88003f0f3600 RCX: 0000000000000101
[39.577] RDX: dead000000000100 RSI: 0000000000000201 RDI: ffff88003f0d7948
[39.578] RBP: ffff88003ba03ea0 R08: 0000000000000001 R09: ffff88003ba03ef8
[39.579] R10: 000000000000014f R11: 0000000000000000 R12: ffff88003f0d7948
[39.580] R13: ffff88003f0f3618 R14: ffffffffa006c250 R15: ffff88003f0f3600
[39.581] FS:  0000000000000000(0000) GS:ffff88003ba00000(0000) knlGS:0000000000000000
[39.582] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[39.583] CR2: 00007f831c6e0714 CR3: 000000003d3b0000 CR4: 00000000000006f0
[39.584] Call Trace:
[39.584]  <IRQ>
[39.585]  call_timer_fn+0x3d/0x180
[39.585]  ? tipc_subscrb_rcv_cb+0x260/0x260 [tipc]
[39.586]  run_timer_softirq+0x168/0x1f0
[39.586]  ? sched_clock_cpu+0x16/0xc0
[39.587]  __do_softirq+0x9b/0x2de
[39.587]  irq_exit+0x60/0x70
[39.588]  smp_apic_timer_interrupt+0x3d/0x50
[39.588]  apic_timer_interrupt+0x86/0x90
[39.589] RIP: 0010:default_idle+0x20/0xf0
[39.589] RSP: 0018:ffffffff81c03e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[39.590] RAX: 0000000000000000 RBX: ffffffff81c10540 RCX: 0000000000000000
[39.591] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[39.592] RBP: ffffffff81c03e68 R08: 0000000000000000 R09: 0000000000000000
[39.593] R10: ffffc90001cbbe00 R11: 0000000000000000 R12: 0000000000000000
[39.594] R13: ffffffff81c10540 R14: 0000000000000000 R15: 0000000000000000
[39.595]  </IRQ>
::
[39.603] RIP: tipc_subscrp_timeout+0x32/0x80 [tipc] RSP: ffff88003ba03e90
[39.604] ---[ end trace 79ce94b7216cb459 ]---

Fixes: 139bb36f75 ("tipc: advance the time of deleting subscription from subscriber->subscrp_list")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-22 14:25:02 -07:00
David S. Miller e2a7c34fb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
Jon Paul Maloy 40501f90ed tipc: don't reset stale broadcast send link
When the broadcast send link after 100 attempts has failed to
transfer a packet to all peers, we consider it stale, and reset
it. Thereafter it needs to re-synchronize with the peers, something
currently done by just resetting and re-establishing all links to
all peers. This has turned out to be overkill, with potentially
unwanted consequences for the remaining cluster.

A closer analysis reveals that this can be done much simpler. When
this kind of failure happens, for reasons that may lie outside the
TIPC protocol, it is typically only one peer which is failing to
receive and acknowledge packets. It is hence sufficient to identify
and reset the links only to that peer to resolve the situation, without
having to reset the broadcast link at all. This solution entails a much
lower risk of negative consequences for the own node as well as for
the overall cluster.

We implement this change in this commit.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-21 13:37:45 -07:00
Eric Dumazet 5bfd37b4de tipc: fix use-after-free
syszkaller reported use-after-free in tipc [1]

When msg->rep skb is freed, set the pointer to NULL,
so that caller does not free it again.

[1]

==================================================================
BUG: KASAN: use-after-free in skb_push+0xd4/0xe0 net/core/skbuff.c:1466
Read of size 8 at addr ffff8801c6e71e90 by task syz-executor5/4115

CPU: 1 PID: 4115 Comm: syz-executor5 Not tainted 4.13.0-rc4+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x24e/0x340 mm/kasan/report.c:409
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
 skb_push+0xd4/0xe0 net/core/skbuff.c:1466
 tipc_nl_compat_recv+0x833/0x18f0 net/tipc/netlink_compat.c:1209
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x31a/0x5d0 net/socket.c:898
 call_write_iter include/linux/fs.h:1743 [inline]
 new_sync_write fs/read_write.c:457 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:470
 vfs_write+0x189/0x510 fs/read_write.c:518
 SYSC_write fs/read_write.c:565 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:557
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x4512e9
RSP: 002b:00007f3bc8184c08 EFLAGS: 00000216 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004512e9
RDX: 0000000000000020 RSI: 0000000020fdb000 RDI: 0000000000000006
RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004b5e76
R13: 00007f3bc8184b48 R14: 00000000004b5e86 R15: 0000000000000000

Allocated by task 4115:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc_node+0x13d/0x750 mm/slab.c:3651
 __alloc_skb+0xf1/0x740 net/core/skbuff.c:219
 alloc_skb include/linux/skbuff.h:903 [inline]
 tipc_tlv_alloc+0x26/0xb0 net/tipc/netlink_compat.c:148
 tipc_nl_compat_dumpit+0xf2/0x3c0 net/tipc/netlink_compat.c:248
 tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
 tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x31a/0x5d0 net/socket.c:898
 call_write_iter include/linux/fs.h:1743 [inline]
 new_sync_write fs/read_write.c:457 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:470
 vfs_write+0x189/0x510 fs/read_write.c:518
 SYSC_write fs/read_write.c:565 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:557
 entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 4115:
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3503 [inline]
 kmem_cache_free+0x77/0x280 mm/slab.c:3763
 kfree_skbmem+0x1a1/0x1d0 net/core/skbuff.c:622
 __kfree_skb net/core/skbuff.c:682 [inline]
 kfree_skb+0x165/0x4c0 net/core/skbuff.c:699
 tipc_nl_compat_dumpit+0x36a/0x3c0 net/tipc/netlink_compat.c:260
 tipc_nl_compat_handle net/tipc/netlink_compat.c:1130 [inline]
 tipc_nl_compat_recv+0x756/0x18f0 net/tipc/netlink_compat.c:1199
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:598
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:623
 netlink_rcv_skb+0x216/0x440 net/netlink/af_netlink.c:2397
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:634
 netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
 netlink_unicast+0x4e8/0x6f0 net/netlink/af_netlink.c:1291
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1854
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 sock_write_iter+0x31a/0x5d0 net/socket.c:898
 call_write_iter include/linux/fs.h:1743 [inline]
 new_sync_write fs/read_write.c:457 [inline]
 __vfs_write+0x684/0x970 fs/read_write.c:470
 vfs_write+0x189/0x510 fs/read_write.c:518
 SYSC_write fs/read_write.c:565 [inline]
 SyS_write+0xef/0x220 fs/read_write.c:557
 entry_SYSCALL_64_fastpath+0x1f/0xbe

The buggy address belongs to the object at ffff8801c6e71dc0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 208 bytes inside of
 224-byte region [ffff8801c6e71dc0, ffff8801c6e71ea0)
The buggy address belongs to the page:
page:ffffea00071b9c40 count:1 mapcount:0 mapping:ffff8801c6e71000 index:0x0
flags: 0x200000000000100(slab)
raw: 0200000000000100 ffff8801c6e71000 0000000000000000 000000010000000c
raw: ffffea0007224a20 ffff8801d98caf48 ffff8801d9e79040 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8801c6e71d80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
 ffff8801c6e71e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8801c6e71e80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
                         ^
 ffff8801c6e71f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8801c6e71f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov  <dvyukov@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-18 15:59:29 -07:00
Jon Paul Maloy 59a361bc6f tipc: avoid inheriting msg_non_seq flag when message is returned
In the function msg_reverse(), we reverse the header while trying to
reuse the original buffer whenever possible. Those rejected/returned
messages are always transmitted as unicast, but the msg_non_seq field
is not explicitly set to zero as it should be.

We have seen cases where multicast senders set the message type to
"NOT dest_droppable", meaning that a multicast message shorter than
one MTU will be returned, e.g., during receive buffer overflow, by
reusing the original buffer. This has the effect that even the
'msg_non_seq' field is inadvertently inherited by the rejected message,
although it is now sent as a unicast message. This again leads the
receiving unicast link endpoint to steer the packet toward the broadcast
link receive function, where it is dropped. The affected unicast link is
thereafter (after 100 failed retransmissions) declared 'stale' and
reset.

We fix this by unconditionally setting the 'msg_non_seq' flag to zero
for all rejected/returned messages.

Reported-by: Canh Duc Luu <canh.d.luu@dektech.com.au>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 11:20:36 -07:00
Jon Paul Maloy fed5f5718c tipc: accept PACKET_MULTICAST packets
On L2 bearers, the TIPC broadcast function is sending out packets using
the corresponding L2 broadcast address. At reception, we filter such
packets under the assumption that they will also be delivered as
broadcast packets.

This assumption doesn't always hold true. Under high load, we have seen
that a switch may convert the destination address and deliver the packet
as a PACKET_MULTICAST, something leading to inadvertently dropped
packets and a stale and reset broadcast link.

We fix this by extending the reception filtering to accept packets of
type PACKET_MULTICAST.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-14 11:19:25 -07:00
Jon Paul Maloy ed43594aed tipc: remove premature ESTABLISH FSM event at link synchronization
When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before  initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 22:38:06 -07:00
Reshetova, Elena 41c6d650f6 net: convert sock.sk_refcnt from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

This patch uses refcount_inc_not_zero() instead of
atomic_inc_not_zero_hint() due to absense of a _hint()
version of refcount API. If the hint() version must
be used, we might need to revisit API.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01 07:39:08 -07:00
Jia-Ju Bai 343eba69c6 net: tipc: Fix a sleep-in-atomic bug in tipc_msg_reverse
The kernel may sleep under a rcu read lock in tipc_msg_reverse, and the
function call path is:
tipc_l2_rcv_msg (acquire the lock by rcu_read_lock)
  tipc_rcv
    tipc_sk_rcv
      tipc_msg_reverse
        pskb_expand_head(GFP_KERNEL) --> may sleep
tipc_node_broadcast
  tipc_node_xmit_skb
    tipc_node_xmit
      tipc_sk_rcv
        tipc_msg_reverse
          pskb_expand_head(GFP_KERNEL) --> may sleep

To fix it, "GFP_KERNEL" is replaced with "GFP_ATOMIC".

Signed-off-by: Jia-Ju Bai <baijiaju1990@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-10 18:20:38 -04:00
Jon Paul Maloy 844cf763fb tipc: make macro tipc_wait_for_cond() smp safe
The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
to fulfil its task. The latter, in turn, is evaluating the stated
condition outside the socket lock context. This is problematic if
the condition is accessing non-trivial data structures which may be
altered by incoming interrupts, as is the case with the cong_links()
linked list, used by socket to keep track of the current set of
congested links. We sometimes see crashes when this list is accessed
by a condition function at the same time as a SOCK_WAKEUP interrupt
is removing an element from the list.

We fix this by expanding selected parts of sk_wait_event() into the
outer macro, while ensuring that all evaluations of a given condition
are performed under socket lock protection.

Fixes: commit 365ad353c2 ("tipc: reduce risk of user starvation during link congestion")
Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 22:19:30 -04:00
Linus Torvalds 8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Jon Paul Maloy ec8a09fbbe tipc: refactor function tipc_sk_recv_stream()
We try to make this function more readable by improving variable names
and comments, using more stack variables, and doing some smaller changes
to the logics. We also rename the function to make it consistent with
naming conventions used elsewhere in the code.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:56:54 -04:00
Jon Paul Maloy e9f8b10101 tipc: refactor function tipc_sk_recvmsg()
We try to make this function more readable by improving variable names
and comments, plus some minor changes to the logics.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-02 15:56:54 -04:00
Parthasarathy Bhuvaragan c1be775628 tipc: close the connection if protocol messages contain errors
When a socket is shutting down, we notify the peer node about the
connection termination by reusing an incoming message if possible.
If the last received message was a connection acknowledgment
message, we reverse this message and set the error code to
TIPC_ERR_NO_PORT and send it to peer.

In tipc_sk_proto_rcv(), we never check for message errors while
processing the connection acknowledgment or probe messages. Thus
this message performs the usual flow control accounting and leaves
the session hanging.

In this commit, we terminate the connection when we receive such
error messages.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan 4e0df4951e tipc: improve error validations for sockets in CONNECTING state
Until now, the checks for sockets in CONNECTING state was based on
the assumption that the incoming message was always from the
peer's accepted data socket.

However an application using a non-blocking socket sends an implicit
connect, this socket which is in CONNECTING state can receive error
messages from the peer's listening socket. As we discard these
messages, the application socket hangs as there due to inactivity.
In addition to this, there are other places where we process errors
but do not notify the user.

In this commit, we process such incoming error messages and notify
our users about them using sk_state_change().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
Parthasarathy Bhuvaragan 42b531de17 tipc: Fix missing connection request handling
In filter_connect, we use waitqueue_active() to check for any
connections to wakeup. But waitqueue_active() is missing memory
barriers while accessing the critical sections, leading to
inconsistent results.

In this commit, we replace this with an SMP safe wq_has_sleeper()
using the generic socket callback sk_data_ready().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-28 12:20:42 -04:00
David S. Miller b1513c3531 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26 22:39:08 -04:00
Parthasarathy Bhuvaragan 05ff837897 tipc: fix socket flow control accounting error at tipc_recv_stream
Until now in tipc_recv_stream(), we update the received
unacknowledged bytes based on a stack variable and not based on the
actual message size.
If the user buffer passed at tipc_recv_stream() is smaller than the
received skb, the size variable in stack differs from the actual
message size in the skb. This leads to a flow control accounting
error causing permanent congestion.

In this commit, we fix this accounting error by always using the
size of the incoming message.

Fixes: 10724cc7bb ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:45:38 -04:00
Parthasarathy Bhuvaragan 3364d61c92 tipc: fix socket flow control accounting error at tipc_send_stream
Until now in tipc_send_stream(), we return -1 when the socket
encounters link congestion even if the socket had successfully
sent partial data. This is incorrect as the application resends
the same the partial data leading to data corruption at
receiver's end.

In this commit, we return the partially sent bytes as the return
value at link congestion.

Fixes: 10724cc7bb ("tipc: redesign connection-level flow control")
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:45:37 -04:00
Pan Bian 78302fd405 tipc: check return value of nlmsg_new
Function nlmsg_new() will return a NULL pointer if there is no enough
memory, and its return value should be checked before it is used.
However, in function tipc_nl_node_get_monitor(), the validation of the
return value of function nlmsg_new() is missed. This patch fixes the
bug.

Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 15:51:30 -04:00
Johannes Berg fe52145f91 netlink: pass extended ACK struct where available
This is an add-on to the previous patch that passes the extended ACK
structure where it's already available by existing genl_info or extack
function arguments.

This was done with this spatch (with some manual adjustment of
indentation):

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, info->extack)
...
}

@@
expression A, B, C, D, E;
identifier fn, info;
@@
fn(..., struct genl_info *info, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, info->extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse(A, B, C, D, E, NULL)
+nla_parse(A, B, C, D, E, extack)
...>
}

@@
expression A, B, C, D, E;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
...
-nlmsg_parse(A, B, C, D, E, NULL)
+nlmsg_parse(A, B, C, D, E, extack)
...
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_parse_nested(A, B, C, D, NULL)
+nla_parse_nested(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nlmsg_validate(A, B, C, D, NULL)
+nlmsg_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C, D;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate(A, B, C, D, NULL)
+nla_validate(A, B, C, D, extack)
...>
}

@@
expression A, B, C;
identifier fn, extack;
@@
fn(..., struct netlink_ext_ack *extack, ...) {
<...
-nla_validate_nested(A, B, C, NULL)
+nla_validate_nested(A, B, C, extack)
...>
}

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Johannes Berg fceb6435e8 netlink: pass extended ACK struct to parsing functions
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-13 13:58:22 -04:00
Erik Hugne 66bc1e8d5d tipc: allow rdm/dgram socketpairs
for socketpairs using connectionless transport, we cache
the respective node local TIPC portid to use in subsequent
calls to send() in the socket's private data.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:10:11 -07:00
Erik Hugne 70b03759e9 tipc: add support for stream/seqpacket socketpairs
sockets A and B are connected back-to-back, similar to what
AF_UNIX does.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-29 14:10:11 -07:00
Ying Xue 7efea60dcf tipc: adjust the policy of holding subscription kref
When a new subscription object is inserted into name_seq->subscriptions
list, it's under name_seq->lock protection; when a subscription is
deleted from the list, it's also under the same lock protection;
similarly, when accessing a subscription by going through subscriptions
list, the entire process is also protected by the name_seq->lock.

Therefore, if subscription refcount is increased before it's inserted
into subscriptions list, and its refcount is decreased after it's
deleted from the list, it will be unnecessary to hold refcount at all
before accessing subscription object which is obtained by going through
subscriptions list under name_seq->lock protection.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 18:03:33 -07:00
Ying Xue 139bb36f75 tipc: advance the time of deleting subscription from subscriber->subscrp_list
After a subscription object is created, it's inserted into its
subscriber subscrp_list list under subscriber lock protection,
similarly, before it's destroyed, it should be first removed from
its subscriber->subscrp_list. Since the subscription list is
accessed with subscriber lock, all the subscriptions are valid
during the lock duration. Hence in tipc_subscrb_subscrp_delete(), we
remove subscription get/put and the extra subscriber unlock/lock.

After this change, the subscriptions refcount cleanup is very simple
and does not access any lock.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-28 18:03:32 -07:00
Ying Xue 557d054c01 tipc: fix nametbl deadlock at tipc_nametbl_unsubscribe
Until now, tipc_nametbl_unsubscribe() is called at subscriptions
reference count cleanup. Usually the subscriptions cleanup is
called at subscription timeout or at subscription cancel or at
subscriber delete.

We have ignored the possibility of this being called from other
locations, which causes deadlock as we try to grab the
tn->nametbl_lock while holding it already.

   CPU1:                             CPU2:
----------                     ----------------
tipc_nametbl_publish
spin_lock_bh(&tn->nametbl_lock)
tipc_nametbl_insert_publ
tipc_nameseq_insert_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
                             tipc_close_conn
                             tipc_subscrb_release_cb
                             tipc_subscrb_delete
                             tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>

   CPU1:                              CPU2:
----------                     ----------------
tipc_nametbl_stop
spin_lock_bh(&tn->nametbl_lock)
tipc_purge_publications
tipc_nameseq_remove_publ
tipc_subscrp_report_overlap
tipc_subscrp_get
tipc_subscrp_send_event
                             tipc_close_conn
                             tipc_subscrb_release_cb
                             tipc_subscrb_delete
                             tipc_subscrp_put
tipc_subscrp_put
tipc_subscrp_kref_release
tipc_nametbl_unsubscribe
spin_lock_bh(&tn->nametbl_lock)
<<grab nametbl_lock again>>

In this commit, we advance the calling of tipc_nametbl_unsubscribe()
from the refcount cleanup to the intended callers.

Fixes: d094c4d5f5 ("tipc: add subscription refcount to avoid invalid delete")
Reported-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 11:59:16 -07:00
David Howells cdfbabfb2f net: Work around lockdep limitation in sockets that use sockets
Lockdep issues a circular dependency warning when AFS issues an operation
through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

The theory lockdep comes up with is as follows:

 (1) If the pagefault handler decides it needs to read pages from AFS, it
     calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
     creating a call requires the socket lock:

	mmap_sem must be taken before sk_lock-AF_RXRPC

 (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
     binds the underlying UDP socket whilst holding its socket lock.
     inet_bind() takes its own socket lock:

	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

 (3) Reading from a TCP socket into a userspace buffer might cause a fault
     and thus cause the kernel to take the mmap_sem, but the TCP socket is
     locked whilst doing this:

	sk_lock-AF_INET must be taken before mmap_sem

However, lockdep's theory is wrong in this instance because it deals only
with lock classes and not individual locks.  The AF_INET lock in (2) isn't
really equivalent to the AF_INET lock in (3) as the former deals with a
socket entirely internal to the kernel that never sees userspace.  This is
a limitation in the design of lockdep.

Fix the general case by:

 (1) Double up all the locking keys used in sockets so that one set are
     used if the socket is created by userspace and the other set is used
     if the socket is created by the kernel.

 (2) Store the kern parameter passed to sk_alloc() in a variable in the
     sock struct (sk_kern_sock).  This informs sock_lock_init(),
     sock_init_data() and sk_clone_lock() as to the lock keys to be used.

     Note that the child created by sk_clone_lock() inherits the parent's
     kern setting.

 (3) Add a 'kern' parameter to ->accept() that is analogous to the one
     passed in to ->create() that distinguishes whether kernel_accept() or
     sys_accept4() was the caller and can be passed to sk_alloc().

     Note that a lot of accept functions merely dequeue an already
     allocated socket.  I haven't touched these as the new socket already
     exists before we get the parameter.

     Note also that there are a couple of places where I've made the accepted
     socket unconditionally kernel-based:

	irda_accept()
	rds_rcp_accept_one()
	tcp_accept_from_sock()

     because they follow a sock_create_kern() and accept off of that.

Whilst creating this, I noticed that lustre and ocfs don't create sockets
through sock_create_kern() and thus they aren't marked as for-kernel,
though they appear to be internal.  I wonder if these should do that so
that they use the new set of lock keys.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-09 18:23:27 -08:00
Ingo Molnar 174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Jon Paul Maloy 681a55d717 tipc: move premature initilalization of stack variables
In the function tipc_rcv() we initialize a couple of stack variables
from the message header before that same header has been validated.
In rare cases when the arriving header is non-linar, the validation
function itself may linearize the buffer by calling skb_may_pull(),
while the wrongly initialized stack fields are not updated accordingly.

We fix this in this commit.

Reported-by: Matthew Wong <mwong@sonusnet.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-24 11:42:54 -05:00
Herbert Xu 40f9f43970 tipc: Fix tipc_sk_reinit race conditions
There are two problems with the function tipc_sk_reinit.  Firstly
it's doing a manual walk over an rhashtable.  This is broken as
an rhashtable can be resized and if you manually walk over it
during a resize then you may miss entries.

Secondly it's missing memory barriers as previously the code used
spinlocks which provide the barriers implicitly.

This patch fixes both problems.

Fixes: 07f6c4bc04 ("tipc: convert tipc reference table to...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 12:28:35 -05:00
David S. Miller 4e8f2fc1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-28 10:33:06 -05:00
Dan Carpenter a08ef4768f tipc: uninitialized return code in tipc_setsockopt()
We shuffled some code around and added some new case statements here and
now "res" isn't initialized on all paths.

Fixes: 01fd12bb18 ("tipc: make replicast a user selectable option")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-25 12:41:34 -05:00
Parthasarathy Bhuvaragan 35e22e49a5 tipc: fix cleanup at module unload
In tipc_server_stop(), we iterate over the connections with limiting
factor as server's idr_in_use. We ignore the fact that this variable
is decremented in tipc_close_conn(), leading to premature exit.

In this commit, we iterate until the we have no connections left.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan 4c887aa65d tipc: ignore requests when the connection state is not CONNECTED
In tipc_conn_sendmsg(), we first queue the request to the outqueue
followed by the connection state check. If the connection is not
connected, we should not queue this message.

In this commit, we reject the messages if the connection state is
not CF_CONNECTED.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan 9dc3abdd1f tipc: fix nametbl_lock soft lockup at module exit
Commit 333f796235 ("tipc: fix a race condition leading to
subscriber refcnt bug") reveals a soft lockup while acquiring
nametbl_lock.

Before commit 333f796235, we call tipc_conn_shutdown() from
tipc_close_conn() in the context of tipc_topsrv_stop(). In that
context, we are allowed to grab the nametbl_lock.

Commit 333f796235, moved tipc_conn_release (renamed from
tipc_conn_shutdown) to the connection refcount cleanup. This allows
either tipc_nametbl_withdraw() or tipc_topsrv_stop() to the cleanup.

Since tipc_exit_net() first calls tipc_topsrv_stop() and then
tipc_nametble_withdraw() increases the chances for the later to
perform the connection cleanup.

The soft lockup occurs in the call chain of tipc_nametbl_withdraw(),
when it performs the tipc_conn_kref_release() as it tries to grab
nametbl_lock again while holding it already.
tipc_nametbl_withdraw() grabs nametbl_lock
  tipc_nametbl_remove_publ()
    tipc_subscrp_report_overlap()
      tipc_subscrp_send_event()
        tipc_conn_sendmsg()
          << if (con->flags != CF_CONNECTED) we do conn_put(),
             triggering the cleanup as refcount=0. >>
          tipc_conn_kref_release
            tipc_sock_release
              tipc_conn_release
                tipc_subscrb_delete
                  tipc_subscrp_delete
                    tipc_nametbl_unsubscribe << Soft Lockup >>

The previous changes in this series fixes the race conditions fixed
by commit 333f796235. Hence we can now revert the commit.

Fixes: 333f796235 ("tipc: fix a race condition leading to subscriber refcnt bug")
Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:58 -05:00
Parthasarathy Bhuvaragan fc0adfc8fd tipc: fix connection refcount error
Until now, the generic server framework maintains the connection
id's per subscriber in server's conn_idr. At tipc_close_conn, we
remove the connection id from the server list, but the connection is
valid until we call the refcount cleanup. Hence we have a window
where the server allocates the same connection to an new subscriber
leading to inconsistent reference count. We have another refcount
warning we grab the refcount in tipc_conn_lookup() for connections
with flag with CF_CONNECTED not set. This usually occurs at shutdown
when the we stop the topology server and withdraw TIPC_CFG_SRV
publication thereby triggering a withdraw message to subscribers.

In this commit, we:
1. remove the connection from the server list at recount cleanup.
2. grab the refcount for a connection only if CF_CONNECTED is set.

Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Parthasarathy Bhuvaragan d094c4d5f5 tipc: add subscription refcount to avoid invalid delete
Until now, the subscribers keep track of the subscriptions using
reference count at subscriber level. At subscription cancel or
subscriber delete, we delete the subscription only if the timer
was pending for the subscription. This approach is incorrect as:
1. del_timer() is not SMP safe, if on CPU0 the check for pending
   timer returns true but CPU1 might schedule the timer callback
   thereby deleting the subscription. Thus when CPU0 is scheduled,
   it deletes an invalid subscription.
2. We export tipc_subscrp_report_overlap(), which accesses the
   subscription pointer multiple times. Meanwhile the subscription
   timer can expire thereby freeing the subscription and we might
   continue to access the subscription pointer leading to memory
   violations.

In this commit, we introduce subscription refcount to avoid deleting
an invalid subscription.

Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Parthasarathy Bhuvaragan 93f955aad4 tipc: fix nametbl_lock soft lockup at node/link events
We trigger a soft lockup as we grab nametbl_lock twice if the node
has a pending node up/down or link up/down event while:
- we process an incoming named message in tipc_named_rcv() and
  perform an tipc_update_nametbl().
- we have pending backlog items in the name distributor queue
  during a nametable update using tipc_nametbl_publish() or
  tipc_nametbl_withdraw().

The following are the call chain associated:
tipc_named_rcv() Grabs nametbl_lock
   tipc_update_nametbl() (publish/withdraw)
     tipc_node_subscribe()/unsubscribe()
       tipc_node_write_unlock()
          << lockup occurs if an outstanding node/link event
             exits, as we grabs nametbl_lock again >>

tipc_nametbl_withdraw() Grab nametbl_lock
  tipc_named_process_backlog()
    tipc_update_nametbl()
      << rest as above >>

The function tipc_node_write_unlock(), in addition to releasing the
lock processes the outstanding node/link up/down events. To do this,
we need to grab the nametbl_lock again leading to the lockup.

In this commit we fix the soft lockup by introducing a fast variant of
node_unlock(), where we just release the lock. We adapt the
node_subscribe()/node_unsubscribe() to use the fast variants.

Reported-and-Tested-by: John Thompson <thompa.atl@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-24 16:14:57 -05:00
Jon Paul Maloy 01fd12bb18 tipc: make replicast a user selectable option
If the bearer carrying multicast messages supports broadcast, those
messages will be sent to all cluster nodes, irrespective of whether
these nodes host any actual destinations socket or not. This is clearly
wasteful if the cluster is large and there are only a few real
destinations for the message being sent.

In this commit we extend the eligibility of the newly introduced
"replicast" transmit option. We now make it possible for a user to
select which method he wants to be used, either as a mandatory setting
via setsockopt(), or as a relative setting where we let the broadcast
layer decide which method to use based on the ratio between cluster
size and the message's actual number of destination nodes.

In the latter case, a sending socket must stick to a previously
selected method until it enters an idle period of at least 5 seconds.
This eliminates the risk of message reordering caused by method change,
i.e., when changes to cluster size or number of destinations would
otherwise mandate a new method to be used.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:17 -05:00
Jon Paul Maloy a853e4c6d0 tipc: introduce replicast as transport option for multicast
TIPC multicast messages are currently carried over a reliable
'broadcast link', making use of the underlying media's ability to
transport packets as L2 broadcast or IP multicast to all nodes in
the cluster.

When the used bearer is lacking that ability, we can instead emulate
the broadcast service by replicating and sending the packets over as
many unicast links as needed to reach all identified destinations.
We now introduce a new TIPC link-level 'replicast' service that does
this.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:17 -05:00
Jon Paul Maloy 2ae0b8af1f tipc: add functionality to lookup multicast destination nodes
As a further preparation for the upcoming 'replicast' functionality,
we add some necessary structs and functions for looking up and returning
a list of all nodes that host destinations for a given multicast message.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:16 -05:00
Jon Paul Maloy 9999974a83 tipc: add function for checking broadcast support in bearer
As a preparation for the 'replicast' functionality we are going to
introduce in the next commits, we need the broadcast base structure to
store whether bearer broadcast is available at all from the currently
used bearer or bearers.

We do this by adding a new function tipc_bearer_bcast_support() to
the bearer layer, and letting the bearer selection function in
bcast.c use this to give a new boolean field, 'bcast_support' the
appropriate value.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-20 12:10:15 -05:00
David S. Miller 580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Parthasarathy Bhuvaragan 57d5f64d83 tipc: allocate user memory with GFP_KERNEL flag
Until now, we allocate memory always with GFP_ATOMIC flag.
When the system is under memory pressure and a user tries to send,
the send fails due to low memory. However, the user application
can wait for free memory if we allocate it using GFP_KERNEL flag.

In this commit, we use allocate memory with GFP_KERNEL for all user
allocation.

Reported-by: Rune Torgersen <runet@innovsys.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:31:53 -05:00
Jon Paul Maloy 365ad353c2 tipc: reduce risk of user starvation during link congestion
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.

This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.

In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is attempted sent via a
congested link, we now let it be added to the link's backlog queue
anyway, thus permitting an oversubscription of one message per source
socket. We still create a wakeup item and return an error code, hence
instructing the sender to block or stop sending. Only when enough space
has been freed up in the link's backlog queue do we issue a wakeup event
that allows the sender to continue with the next message, if any.

The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:13:05 -05:00
Jon Paul Maloy 4d8642d896 tipc: modify struct tipc_plist to be more versatile
During multicast reception we currently use a simple linked list with
push/pop semantics to store port numbers.

We now see a need for a more generic list for storing values of type
u32. We therefore make some modifications to this list, while replacing
the prefix 'tipc_plist_' with 'u32_'. We also add a couple of new
functions which will come to use in the next commits.

Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:13:05 -05:00
Jon Paul Maloy 8c44e1af16 tipc: unify tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() functions
The functions tipc_wait_for_sndpkt() and tipc_wait_for_sndmsg() are very
similar. The latter function is also called from two locations, and
there will be more in the coming commits, which will all need to test on
different conditions.

Instead of making yet another duplicates of the function, we now
introduce a new macro tipc_wait_for_cond() where the wakeup condition
can be stated as an argument to the call. This macro replaces all
current and future uses of the two functions, which can now be
eliminated.

Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-03 11:13:05 -05:00
Jon Paul Maloy 693c56491f tipc: don't send FIN message from connectionless socket
In commit 6f00089c73 ("tipc: remove SS_DISCONNECTING state") the
check for socket type is in the wrong place, causing a closing socket
to always send out a FIN message even when the socket was never
connected. This is normally harmless, since the destination node for
such messages most often is zero, and the message will be dropped, but
it is still a wrong and confusing behavior.

We fix this in this commit.

Reviewed-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-23 17:53:47 -05:00
Linus Torvalds 9a19a6db37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:

 - more ->d_init() stuff (work.dcache)

 - pathname resolution cleanups (work.namei)

 - a few missing iov_iter primitives - copy_from_iter_full() and
   friends. Either copy the full requested amount, advance the iterator
   and return true, or fail, return false and do _not_ advance the
   iterator. Quite a few open-coded callers converted (and became more
   readable and harder to fuck up that way) (work.iov_iter)

 - several assorted patches, the big one being logfs removal

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  logfs: remove from tree
  vfs: fix put_compat_statfs64() does not handle errors
  namei: fold should_follow_link() with the step into not-followed link
  namei: pass both WALK_GET and WALK_MORE to should_follow_link()
  namei: invert WALK_PUT logics
  namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
  namei: saner calling conventions for mountpoint_last()
  namei.c: get rid of user_path_parent()
  switch getfrag callbacks to ..._full() primitives
  make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
  [iov_iter] new primitives - copy_from_iter_full() and friends
  don't open-code file_inode()
  ceph: switch to use of ->d_init()
  ceph: unify dentry_operations instances
  lustre: switch to use of ->d_init()
2016-12-16 10:24:44 -08:00
Al Viro cbbd26b8b1 [iov_iter] new primitives - copy_from_iter_full() and friends
copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-05 14:33:36 -05:00
David S. Miller 2745529ac7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Couple conflicts resolved here:

1) In the MACB driver, a bug fix to properly initialize the
   RX tail pointer properly overlapped with some changes
   to support variable sized rings.

2) In XGBE we had a "CONFIG_PM" --> "CONFIG_PM_SLEEP" fix
   overlapping with a reorganization of the driver to support
   ACPI, OF, as well as PCI variants of the chip.

3) In 'net' we had several probe error path bug fixes to the
   stmmac driver, meanwhile a lot of this code was cleaned up
   and reorganized in 'net-next'.

4) The cls_flower classifier obtained a helper function in
   'net-next' called __fl_delete() and this overlapped with
   Daniel Borkamann's bug fix to use RCU for object destruction
   in 'net'.  It also overlapped with Jiri's change to guard
   the rhashtable_remove_fast() call with a check against
   tc_skip_sw().

5) In mlx4, a revert bug fix in 'net' overlapped with some
   unrelated changes in 'net-next'.

6) In geneve, a stale header pointer after pskb_expand_head()
   bug fix in 'net' overlapped with a large reorganization of
   the same code in 'net-next'.  Since the 'net-next' code no
   longer had the bug in question, there was nothing to do
   other than to simply take the 'net-next' hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-03 12:29:53 -05:00
Michal Kubeček 3de81b7588 tipc: check minimum bearer MTU
Qian Zhang (张谦) reported a potential socket buffer overflow in
tipc_msg_build() which is also known as CVE-2016-8632: due to
insufficient checks, a buffer overflow can occur if MTU is too short for
even tipc headers. As anyone can set device MTU in a user/net namespace,
this issue can be abused by a regular user.

As agreed in the discussion on Ben Hutchings' original patch, we should
check the MTU at the moment a bearer is attached rather than for each
processed packet. We also need to repeat the check when bearer MTU is
adjusted to new device MTU. UDP case also needs a check to avoid
overflow when calculating bearer MTU.

Fixes: b97bf3fd8f ("[TIPC] Initial merge")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-02 14:03:20 -05:00
Jon Paul Maloy 9590112241 tipc: fix link statistics counter errors
In commit e4bf4f7696 ("tipc: simplify packet sequence number
handling") we changed the internal representation of the packet
sequence number counters from u32 to u16, reflecting what is really
sent over the wire.

Since then some link statistics counters have been displaying incorrect
values, partially because the counters meant to be used as sequence
number snapshots are now used as direct counters, stored as u32, and
partially because some counter updates are just missing in the code.

In this commit we correct this in two ways. First, we base the
displayed packet sent/received values on direct counters instead
of as previously a calculated difference between current sequence
number and a snapshot. Second, we add the missing updates of the
counters.

This change is compatible with the current netlink API, and requires
no changes to the user space tools.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-27 20:35:55 -05:00
David S. Miller 0b42f25d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
udplite conflict is resolved by taking what 'net-next' did
which removed the backlog receive method assignment, since
it is no longer necessary.

Two entries were added to the non-priv ethtool operations
switch statement, one in 'net' and one in 'net-next, so
simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-26 23:42:21 -05:00
Jon Paul Maloy 6998cc6ec2 tipc: resolve connection flow control compatibility problem
In commit 10724cc7bb ("tipc: redesign connection-level flow control")
we replaced the previous message based flow control with one based on
1k blocks. In order to ensure backwards compatibility the mechanism
falls back to using message as base unit when it senses that the peer
doesn't support the new algorithm. The default flow control window,
i.e., how many units can be sent before the sender blocks and waits
for an acknowledge (aka advertisement) is 512. This was tested against
the previous version, which uses an acknowledge frequency of on ack per
256 received message, and found to work fine.

However, we missed the fact that versions older than Linux 3.15 use an
acknowledge frequency of 512, which is exactly the limit where a 4.6+
sender will stop and wait for acknowledge. This would also work fine if
it weren't for the fact that if the first sent message on a 4.6+ server
side is an empty SYNACK, this one is also is counted as a sent message,
while it is not counted as a received message on a legacy 3.15-receiver.
This leads to the sender always being one step ahead of the receiver, a
scenario causing the sender to block after 512 sent messages, while the
receiver only has registered 511 read messages. Hence, the legacy
receiver is not trigged to send an acknowledge, with a permanently
blocked sender as result.

We solve this deadlock by simply allowing the sender to send one more
message before it blocks, i.e., by a making minimal change to the
condition used for determining connection congestion.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 21:38:16 -05:00
Jon Paul Maloy d876a4d2af tipc: improve sanity check for received domain records
In commit 35c55c9877 ("tipc: add neighbor monitoring framework") we
added a data area to the link monitor STATE messages under the
assumption that previous versions did not use any such data area.

For versions older than Linux 4.3 this assumption is not correct. In
those version, all STATE messages sent out from a node inadvertently
contain a 16 byte data area containing a string; -a leftover from
previous RESET messages which were using this during the setup phase.
This string serves no purpose in STATE messages, and should no be there.

Unfortunately, this data area is delivered to the link monitor
framework, where a sanity check catches that it is not a correct domain
record, and drops it. It also issues a rate limited warning about the
event.

Since such events occur much more frequently than anticipated, we now
choose to remove the warning in order to not fill the kernel log with
useless contents. We also make the sanity check stricter, to further
reduce the risk that such data is inavertently admitted.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:06:18 -05:00
Jon Paul Maloy f79675563a tipc: fix compatibility bug in link monitoring
commit 817298102b ("tipc: fix link priority propagation") introduced a
compatibility problem between TIPC versions newer than Linux 4.6 and
those older than Linux 4.4. In versions later than 4.4, link STATE
messages only contain a non-zero link priority value when the sender
wants the receiver to change its priority. This has the effect that the
receiver resets itself in order to apply the new priority. This works
well, and is consistent with the said commit.

However, in versions older than 4.4 a valid link priority is present in
all sent link STATE messages, leading to cyclic link establishment and
reset on the 4.6+ node.

We fix this by adding a test that the received value should not only
be valid, but also differ from the current value in order to cause the
receiving link endpoint to reset.

Reported-by: Amar Nv <amar.nv005@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-25 20:06:18 -05:00
David S. Miller f9aa9dc7d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All conflicts were simple overlapping changes except perhaps
for the Thunder driver.

That driver has a change_mtu method explicitly for sending
a message to the hardware.  If that fails it returns an
error.

Normally a driver doesn't need an ndo_change_mtu method becuase those
are usually just range changes, which are now handled generically.
But since this extra operation is needed in the Thunder driver, it has
to stay.

However, if the message send fails we have to restore the original
MTU before the change because the entire call chain expects that if
an error is thrown by ndo_change_mtu then the MTU did not change.
Therefore code is added to nicvf_change_mtu to remember the original
MTU, and to restore it upon nicvf_update_hw_max_frs() failue.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-22 13:27:16 -05:00
Jon Paul Maloy 51b9a31c42 tipc: eliminate obsolete socket locking policy description
The comment block in socket.c describing the locking policy is
obsolete, and does not reflect current reality. We remove it in this
commit.

Since the current locking policy is much simpler and follows a
mainstream approach, we see no need to add a new description.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-19 22:15:41 -05:00
Alexey Dobriyan c7d03a00b5 netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.

There are 2 reasons to do so:

1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.

2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.

"int" being used as an array index needs to be sign-extended
to 64-bit before being used.

	void f(long *p, int i)
	{
		g(p[i]);
	}

  roughly translates to

	movsx	rsi, esi
	mov	rdi, [rsi+...]
	call 	g

MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.

Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:

	static inline void *net_generic(const struct net *net, int id)
	{
		...
		ptr = ng->ptr[id - 1];
		...
	}

And this function is used a lot, so those sign extensions add up.

Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]

However, overall balance is in negative direction:

	add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
	function                                     old     new   delta
	nfsd4_lock                                  3886    3959     +73
	tipc_link_build_proto_msg                   1096    1140     +44
	mac80211_hwsim_new_radio                    2776    2808     +32
	tipc_mon_rcv                                1032    1058     +26
	svcauth_gss_legacy_init                     1413    1429     +16
	tipc_bcbase_select_primary                   379     392     +13
	nfsd4_exchange_id                           1247    1260     +13
	nfsd4_setclientid_confirm                    782     793     +11
		...
	put_client_renew_locked                      494     480     -14
	ip_set_sockfn_get                            730     716     -14
	geneve_sock_add                              829     813     -16
	nfsd4_sequence_done                          721     703     -18
	nlmclnt_lookup_host                          708     686     -22
	nfsd4_lockt                                 1085    1063     -22
	nfs_get_client                              1077    1050     -27
	tcf_bpf_init                                1106    1076     -30
	nfsd4_encode_fattr                          5997    5930     -67
	Total: Before=154856051, After=154854321, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-18 10:59:15 -05:00
WANG Cong d9dc8b0f8b net: fix sleeping for sk_wait_event()
Similar to commit 14135f30e3 ("inet: fix sleeping inside inet_wait_for_connect()"),
sk_wait_event() needs to fix too, because release_sock() is blocking,
it changes the process state back to running after sleep, which breaks
the previous prepare_to_wait().

Switch to the new wait API.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-14 13:17:21 -05:00
Parthasarathy Bhuvaragan f40acbaf42 tipc: remove SS_CONNECTED sock state
In this commit, we replace references to sock->state SS_CONNECTE
with sk_state TIPC_ESTABLISHED.

Finally, the sock->state is no longer explicitly used by tipc.
The FSM below is for various types of connection oriented sockets.

Stream Server Listening Socket:
+-----------+       +-------------+
| TIPC_OPEN |------>| TIPC_LISTEN |
+-----------+       +-------------+

Stream Server Data Socket:
+-----------+       +------------------+
| TIPC_OPEN |------>| TIPC_ESTABLISHED |
+-----------+       +------------------+
                          ^   |
                          |   |
                          |   v
                    +--------------------+
                    | TIPC_DISCONNECTING |
                    +--------------------+

Stream Socket Client:
+-----------+       +-----------------+
| TIPC_OPEN |------>| TIPC_CONNECTING |------+
+-----------+       +-----------------+      |
                            |                |
                            |                |
                            v                |
                    +------------------+     |
                    | TIPC_ESTABLISHED |     |
                    +------------------+     |
                          ^   |              |
                          |   |              |
                          |   v              |
                    +--------------------+   |
                    | TIPC_DISCONNECTING |<--+
                    +--------------------+

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan 99a2088981 tipc: create TIPC_CONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_CONNECTING
by primarily replacing the SS_CONNECTING with TIPC_CONNECTING.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:26 -04:00
Parthasarathy Bhuvaragan 6f00089c73 tipc: remove SS_DISCONNECTING state
In this commit, we replace the references to SS_DISCONNECTING with
the combination of sk_state TIPC_DISCONNECTING and flags set in
sk_shutdown.
We introduce a new function _tipc_shutdown(), which provides
the common code required by tipc_release() and tipc_shutdown().

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 9fd4b070f6 tipc: create TIPC_DISCONNECTING as a new sk_state
In this commit, we create a new tipc socket state TIPC_DISCONNECTING in
sk_state. TIPC_DISCONNECTING is replacing the socket connection status
update using SS_DISCONNECTING.
TIPC_DISCONNECTING is set for connection oriented sockets at:
- tipc_shutdown()
- connection probe timeout
- when we receive an error message on the connection.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 438adcaf0d tipc: create TIPC_OPEN as a new sk_state
In this commit, we create a new tipc socket state TIPC_OPEN in
sk_state. We primarily replace the SS_UNCONNECTED sock->state with
TIPC_OPEN.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 8ea642ee9a tipc: create TIPC_ESTABLISHED as a new sk_state
Until now, tipc maintains probing state for connected sockets in
tsk->probing_state variable.

In this commit, we express this information as socket states and
this remove the variable. We set probe_unacked flag when a probe
is sent out and reset it if we receive a reply. Instead of the
probing state TIPC_CONN_OK, we create a new state TIPC_ESTABLISHED.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan 0c288c8692 tipc: create TIPC_LISTEN as a new sk_state
Until now, tipc maintains the socket state in sock->state variable.
This is used to maintain generic socket states, but in tipc
we overload it and save tipc socket states like TIPC_LISTEN.
Other protocols like TCP, UDP store protocol specific states
in sk->sk_state instead.

In this commit, we :
- declare a new tipc state TIPC_LISTEN, that replaces SS_LISTEN
- Create a new function tipc_set_state(), to update sk->sk_state.
- TIPC_LISTEN state is maintained in sk->sk_state.
- replace references to SS_LISTEN with TIPC_LISTEN.

There is no functional change in this commit.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:25 -04:00
Parthasarathy Bhuvaragan c752023aab tipc: remove socket state SS_READY
Until now, tipc socket state SS_READY declares that the socket is a
connectionless socket.

In this commit, we remove the state SS_READY and replace it with a
condition which returns true for datagram / connectionless sockets.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan 360aab6b49 tipc: remove probing_intv from tipc_sock
Until now, probing_intv is a variable in struct tipc_sock but is
always set to a constant CONN_PROBING_INTERVAL. The socket
connection is probed based on this value.

In this commit, we remove this variable and setup the socket
timer based on the constant CONN_PROBING_INTERVAL.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan d6fb7e9c99 tipc: remove tsk->connected from tipc_sock
Until now, we determine if a socket is connected or not based on
tsk->connected, which is set once when the probing state is set
to TIPC_CONN_OK. It is unset when the sock->state is updated from
SS_CONNECTED to any other state.

In this commit, we remove connected variable from tipc_sock and
derive socket connection status from the following condition:
sock->state == SS_CONNECTED => tsk->connected

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan 87227fe7e4 tipc: remove tsk->connected for connectionless sockets
Until now, for connectionless sockets the peer information during
connect is stored in tsk->peer and a connection state is set in
tsk->connected. This is redundant.

In this commit, for connectionless sockets we update:
- __tipc_sendmsg(), when the destination is NULL the peer existence
  is determined by tsk->peer.family, instead of tsk->connected.
- tipc_connect(), remove set/unset of tsk->connected.
Hence tsk->connected is no longer used for connectionless sockets.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan aeda16b6ae tipc: rename tsk->remote to tsk->peer for consistent naming
Until now, the peer information for connect is stored in tsk->remote
but the rest of code uses the name peer for peer/remote.

In this commit, we rename tsk->remote to tsk->peer to align with
naming convention followed in the rest of the code.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:24 -04:00
Parthasarathy Bhuvaragan ba8aebe943 tipc: rename struct tipc_skb_cb member handle to bytes_read
In this commit, we rename handle to bytes_read indicating the
purpose of the member.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan cb5da847af tipc: set kern=0 in sk_alloc() during tipc_accept()
Until now, tipc_accept() calls sk_alloc() with kern=1. This is
incorrect as the data socket's owner is the user application.
Thus for these accepted data sockets the network namespace
refcount is skipped.

In this commit, we fix this by setting kern=0.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan 4891d8fe16 tipc: wakeup sleeping users at disconnect
Until now, in filter_connect() when we terminate a connection due to
an error message from peer, we set the socket state to DISCONNECTING.

The socket is notified about this broken connection using EPIPE when
a user tries to send a message. However if a socket was waiting on a
poll() while the connection is being terminated, we fail to wakeup
that socket.

In this commit, we wakeup sleeping sockets at connection termination.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
Parthasarathy Bhuvaragan 7cf87fa278 tipc: return early for non-blocking sockets at link congestion
Until now, in stream/mcast send() we pass the message to the link
layer even when the link is congested and add the socket to the
link's wakeup queue. This is unnecessary for non-blocking sockets.
If a socket is set to non-blocking and sends multicast with zero
back off time while receiving EAGAIN, we exhaust the memory.

In this commit, we return immediately at stream/mcast send() for
non-blocking sockets.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-01 11:53:23 -04:00
David S. Miller 27058af401 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Mostly simple overlapping changes.

For example, David Ahern's adjacency list revamp in 'net-next'
conflicted with an adjacency list traversal bug fix in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30 12:42:58 -04:00
Jon Paul Maloy 06bd2b1ed0 tipc: fix broadcast link synchronization problem
In commit 2d18ac4ba7 ("tipc: extend broadcast link initialization
criteria") we tried to fix a problem with the initial synchronization
of broadcast link acknowledge values. Unfortunately that solution is
not sufficient to solve the issue.

We have seen it happen that LINK_PROTOCOL/STATE packets with a valid
non-zero unicast acknowledge number may bypass BCAST_PROTOCOL
initialization, NAME_DISTRIBUTOR and other STATE packets with invalid
broadcast acknowledge numbers, leading to premature opening of the
broadcast link. When the bypassed packets finally arrive, they are
inadvertently accepted, and the already correctly initialized
acknowledge number in the broadcast receive link is overwritten by
the invalid (zero) value of the said packets. After this the broadcast
link goes stale.

We now fix this by marking the packets where we know the acknowledge
value is or may be invalid, and then ignoring the acks from those.

To this purpose, we claim an unused bit in the header to indicate that
the value is invalid. We set the bit to 1 in the initial BCAST_PROTOCOL
synchronization packet and all initial ("bulk") NAME_DISTRIBUTOR
packets, plus those LINK_PROTOCOL packets sent out before the broadcast
links are fully synchronized.

This minor protocol update is fully backwards compatible.

Reported-by: John Thompson <thompa.atl@gmail.com>
Tested-by: John Thompson <thompa.atl@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-29 17:21:09 -04:00
Johannes Berg 56989f6d85 genetlink: mark families as __ro_after_init
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.

In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.

This protects the data structure from accidental corruption.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg 489111e5c2 genetlink: statically initialize families
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.

This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg a07ea4d994 genetlink: no longer support using static family IDs
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.

Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)

Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:09 -04:00
Johannes Berg c90c39dab3 genetlink: introduce and use genl_family_attrbuf()
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-27 16:16:08 -04:00
Dan Carpenter 7307616245 tipc: info leak in __tipc_nl_add_udp_addr()
We should clear out the padding and unused struct members so that we
don't expose stack information to userspace.

Fixes: fdb3accc2c ('tipc: add the ability to get UDP options via netlink')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-13 12:10:01 -04:00
Wei Yongjun c20cb81193 tipc: fix possible memory leak in tipc_udp_enable()
'ub' is malloced in tipc_udp_enable() and should be freed before
leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: ba5aa84a2d ("tipc: split UDP nl address parsing")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-13 11:28:32 -04:00
David S. Miller b20b378d49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mediatek/mtk_eth_soc.c
	drivers/net/ethernet/qlogic/qed/qed_dcbx.c
	drivers/net/phy/Kconfig

All conflicts were cases of overlapping commits.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-12 15:52:44 -07:00
Jon Paul Maloy e0a05ebe26 tipc: send broadcast nack directly upon sequence gap detection
Because of the risk of an excessive number of NACK messages and
retransissions, receivers have until now abstained from sending
broadcast NACKS directly upon detection of a packet sequence number
gap. We have instead relied on such gaps being detected by link
protocol STATE message exchange, something that by necessity delays
such detection and subsequent retransmissions.

With the introduction of unicast NACK transmission and rate control
of retransmissions we can now remove this limitation. We now allow
receiving nodes to send NACKS immediately, while coordinating the
permission to do so among the nodes in order to avoid NACK storms.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:25 -07:00
Jon Paul Maloy 7c4a54b963 tipc: rate limit broadcast retransmissions
As cluster sizes grow, so does the amount of identical or overlapping
broadcast NACKs generated by the packet receivers. This often leads to
'NACK crunches' resulting in huge numbers of redundant retransmissions
of the same packet ranges.

In this commit, we introduce rate control of broadcast retransmissions,
so that a retransmitted range cannot be retransmitted again until after
at least 10 ms. This reduces the frequency of duplicate, redundant
retransmissions by an order of magnitude, while having a significant
positive impact on overall throughput and scalability.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:24 -07:00
Jon Paul Maloy 02d11ca200 tipc: transfer broadcast nacks in link state messages
When we send broadcasts in clusters of more 70-80 nodes, we sometimes
see the broadcast link resetting because of an excessive number of
retransmissions. This is caused by a combination of two factors:

1) A 'NACK crunch", where loss of broadcast packets is discovered
   and NACK'ed by several nodes simultaneously, leading to multiple
   redundant broadcast retransmissions.

2) The fact that the NACKS as such also are sent as broadcast, leading
   to excessive load and packet loss on the transmitting switch/bridge.

This commit deals with the latter problem, by moving sending of
broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type
to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused
bits in word 8 of the said message for this purpose, and introduce a
new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change
backwards compatible.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-02 17:10:24 -07:00
Parthasarathy Bhuvaragan d2f394dc48 tipc: fix random link resets while adding a second bearer
In a dual bearer configuration, if the second tipc link becomes
active while the first link still has pending nametable "bulk"
updates, it randomly leads to reset of the second link.

When a link is established, the function named_distribute(),
fills the skb based on node mtu (allows room for TUNNEL_PROTOCOL)
with NAME_DISTRIBUTOR message for each PUBLICATION.
However, the function named_distribute() allocates the buffer by
increasing the node mtu by INT_H_SIZE (to insert NAME_DISTRIBUTOR).
This consumes the space allocated for TUNNEL_PROTOCOL.

When establishing the second link, the link shall tunnel all the
messages in the first link queue including the "bulk" update.
As size of the NAME_DISTRIBUTOR messages while tunnelling, exceeds
the link mtu the transmission fails (-EMSGSIZE).

Thus, the synch point based on the message count of the tunnel
packets is never reached leading to link timeout.

In this commit, we adjust the size of name distributor message so that
they can be tunnelled.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-01 10:12:26 -07:00
David S. Miller 6abdd5f593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30 00:54:02 -04:00
Richard Alpe 832629ca5c tipc: add UDP remoteip dump to netlink API
When using replicast a UDP bearer can have an arbitrary amount of
remote ip addresses associated with it. This means we cannot simply
add all remote ip addresses to an existing bearer data message as it
might fill the message, leaving us with a truncated message that we
can't safely resume. To handle this we introduce the new netlink
command TIPC_NL_UDP_GET_REMOTEIP. This command is intended to be
called when the bearer data message has the
TIPC_NLA_UDP_MULTI_REMOTEIP flag set, indicating there are more than
one remote ip (replicast).

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe fdb3accc2c tipc: add the ability to get UDP options via netlink
Add UDP bearer options to netlink bearer get message. This is used by
the tipc user space tool to display UDP options.

The UDP bearer information is passed using either a sockaddr_in or
sockaddr_in6 structs. This means the user space receiver should
intermediately store the retrieved data in a large enough struct
(sockaddr_strage) before casting to the proper IP version type.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe c9b64d492b tipc: add replicast peer discovery
Automatically learn UDP remote IP addresses of communicating peers by
looking at the source IP address of incoming TIPC link configuration
messages (neighbor discovery).

This makes configuration slightly easier and removes the problematic
scenario where a node receives directly addressed neighbor discovery
messages sent using replicast which the node cannot "reply" to using
mutlicast, leaving the link FSM in a limbo state.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe ef20cd4dd1 tipc: introduce UDP replicast
This patch introduces UDP replicast. A concept where we emulate
multicast by sending multiple unicast messages to configured peers.

The purpose of replicast is mainly to be able to use TIPC in cloud
environments where IP multicast is disabled. Using replicas to unicast
multicast messages is costly as we have to copy each skb and send the
copies individually.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:41 -07:00
Richard Alpe 1ca73e3fa1 tipc: refactor multicast ip check
Add a function to check if a tipc UDP media address is a multicast
address or not. This is a purely cosmetic change.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:40 -07:00
Richard Alpe ce984da36e tipc: split UDP send function
Split the UDP send function into two. One callback that prepares the
skb and one transmit function that sends the skb. This will come in
handy in later patches, when we introduce UDP replicast.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:40 -07:00
Richard Alpe ba5aa84a2d tipc: split UDP nl address parsing
Split the UDP netlink parse function so that it only parses one
netlink attribute at the time. This makes the parse function more
generic and allow future UDP API functions to use it for parsing.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-26 21:38:40 -07:00
Wei Yongjun a5de125dd4 tipc: fix the error handling in tipc_udp_enable()
Fix to return a negative error code in enable_mcast() error handling
case, and release udp socket when necessary.

Fixes: d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-25 16:32:34 -07:00
Wei Yongjun 5128b18522 tipc: use kfree_skb() instead of kfree()
Use kfree_skb() instead of kfree() to free sk_buff.

Fixes: 0d051bf93c ("tipc: make bearer packet filtering generic")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23 23:08:25 -07:00
Richard Alpe b34040227b tipc: add peer removal functionality
Add TIPC_NL_PEER_REMOVE netlink command. This command can remove
an offline peer node from the internal data structures.

This will be supported by the tipc user space tool in iproute2.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 23:36:07 -07:00
Jon Paul Maloy 5a0950c272 tipc: ensure that link congestion and wakeup use same criteria
When a link is attempted woken up after congestion, it uses a different,
more generous criteria than when it was originally declared congested.
This has the effect that the link, and the sending process, sometimes
will be woken up unnecessarily, just to immediately return to congestion
when it turns out there is not not enough space in its send queue to
host the pending message. This is a waste of CPU cycles.

We now change the function link_prepare_wakeup() to use exactly the same
criteria as tipc_link_xmit(). However, since we are now excluding the
window limit from the wakeup calculation, and the current backlog limit
for the lowest level is too small to house even a single maximum-size
message, we have to expand this limit. We do this by evaluating an
alternative, minimum value during the setting of the importance limits.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 21:14:37 -07:00
Jon Paul Maloy 0d051bf93c tipc: make bearer packet filtering generic
In commit 5b7066c3dd ("tipc: stricter filtering of packets in bearer
layer") we introduced a method of filtering out messages while a bearer
is being reset, to avoid that links may be re-created and come back in
working state while we are still in the process of shutting them down.

This solution works well, but is limited to only work with L2 media, which
is insufficient with the increasing use of UDP as carrier media.

We now replace this solution with a more generic one, by introducing a
new flag "up" in the generic struct tipc_bearer. This field will be set
and reset at the same locations as with the previous solution, while
the packet filtering is moved to the generic code for the sending side.
On the receiving side, the filtering is still done in media specific
code, but now including the UDP bearer.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-18 21:14:36 -07:00
Vegard Nossum d2fbdf76b8 tipc: fix NULL pointer dereference in shutdown()
tipc_msg_create() can return a NULL skb and if so, we shouldn't try to
call tipc_node_xmit_skb() on it.

    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 3 PID: 30298 Comm: trinity-c0 Not tainted 4.7.0-rc7+ #19
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff8800baf09980 ti: ffff8800595b8000 task.ti: ffff8800595b8000
    RIP: 0010:[<ffffffff830bb46b>]  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
    RSP: 0018:ffff8800595bfce8  EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000003023b0e0
    RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffffffff83d12580
    RBP: ffff8800595bfd78 R08: ffffed000b2b7f32 R09: 0000000000000000
    R10: fffffbfff0759725 R11: 0000000000000000 R12: 1ffff1000b2b7f9f
    R13: ffff8800595bfd58 R14: ffffffff83d12580 R15: dffffc0000000000
    FS:  00007fcdde242700(0000) GS:ffff88011af80000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fcddde1db10 CR3: 000000006874b000 CR4: 00000000000006e0
    DR0: 00007fcdde248000 DR1: 00007fcddd73d000 DR2: 00007fcdde248000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000090602
    Stack:
     0000000000000018 0000000000000018 0000000041b58ab3 ffffffff83954208
     ffffffff830bb400 ffff8800595bfd30 ffffffff8309d767 0000000000000018
     0000000000000018 ffff8800595bfd78 ffffffff8309da1a 00000000810ee611
    Call Trace:
     [<ffffffff830c84a3>] tipc_shutdown+0x553/0x880
     [<ffffffff825b4a3b>] SyS_shutdown+0x14b/0x170
     [<ffffffff8100334c>] do_syscall_64+0x19c/0x410
     [<ffffffff83295ca5>] entry_SYSCALL64_slow_path+0x25/0x25
    Code: 90 00 b4 0b 83 c7 00 f1 f1 f1 f1 4c 8d 6d e0 c7 40 04 00 00 00 f4 c7 40 08 f3 f3 f3 f3 48 89 d8 48 c1 e8 03 c7 45 b4 00 00 00 00 <80> 3c 30 00 75 78 48 8d 7b 08 49 8d 75 c0 48 b8 00 00 00 00 00
    RIP  [<ffffffff830bb46b>] tipc_node_xmit_skb+0x6b/0x140
     RSP <ffff8800595bfce8>
    ---[ end trace 57b0484e351e71f1 ]---

I feel like we should maybe return -ENOMEM or -ENOBUFS, but I'm not sure
userspace is equipped to handle that. Anyway, this is better than a GPF
and looks somewhat consistent with other tipc_msg_create() callers.

Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-15 13:55:36 -07:00
Parthasarathy Bhuvaragan 672ca65d9a tipc: fix variable dereference before NULL check
In commit cf6f7e1d51 ("tipc: dump monitor attributes"),
I dereferenced a pointer before checking if its valid.
This is reported by static check Smatch as:
net/tipc/monitor.c:733 tipc_nl_add_monitor_peer()
     warn: variable dereferenced before check 'mon' (see line 731)

In this commit, we check for a valid monitor before proceeding
with any other operation.

Fixes: cf6f7e1d51 ("tipc: dump monitor attributes")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-10 17:56:52 -07:00
Wei Yongjun 6b65bc2972 tipc: fix imbalance read_unlock_bh in __tipc_nl_add_monitor()
In the error handling case of nla_nest_start() failed read_unlock_bh()
is called  to unlock a lock that had not been taken yet. sparse warns
about the context imbalance as the following:

net/tipc/monitor.c:799:23: warning:
 context imbalance in '__tipc_nl_add_monitor' - different lock contexts for basic block

Fixes: cf6f7e1d51 ('tipc: dump monitor attributes')
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-30 20:38:22 -07:00
Parthasarathy Bhuvaragan cf6f7e1d51 tipc: dump monitor attributes
In this commit, we dump the monitor attributes when queried.
The link monitor attributes are separated into two kinds:
1. general attributes per bearer
2. specific attributes per node/peer
This style resembles the socket attributes and the nametable
publications per socket.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26 14:26:42 -07:00
Parthasarathy Bhuvaragan ff0d3e78a6 tipc: add a function to get the bearer name
Introduce a new function to get the bearer name from
its id. This is used in subsequent commit.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26 14:26:42 -07:00
Parthasarathy Bhuvaragan bf1035b2ff tipc: get monitor threshold for the cluster
In this commit, we add support to fetch the configured
cluster monitoring threshold.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26 14:26:42 -07:00
Parthasarathy Bhuvaragan 7b3f522964 tipc: make cluster size threshold for monitoring configurable
In this commit, we introduce support to configure the minimum
threshold to activate the new link monitoring algorithm.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26 14:26:42 -07:00
Parthasarathy Bhuvaragan 9ff26e9fab tipc: introduce constants for tipc address validation
In this commit, we introduce defines for tipc address size,
offset and mask specification for Zone.Cluster.Node.
There is no functional change in this commit.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-26 14:26:42 -07:00
David S. Miller de0ba9a0d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Just several instances of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24 00:53:32 -04:00
Jon Paul Maloy 1fc07f3e15 tipc: reset all unicast links when broadcast send link fails
In test situations with many nodes and a heavily stressed system we have
observed that the transmission broadcast link may fail due to an
excessive number of retransmissions of the same packet. In such
situations we need to reset all unicast links to all peers, in order to
reset and re-synchronize the broadcast link.

In this commit, we add a new function tipc_bearer_reset_all() to be used
in such situations. The function scans across all bearers and resets all
their pertaining links.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:42:12 -07:00
Jon Paul Maloy a71eb72035 tipc: ensure correct broadcast send buffer release when peer is lost
After a new receiver peer has been added to the broadcast transmission
link, we allow immediate transmission of new broadcast packets, trusting
that the new peer will not accept the packets until it has received the
previously sent unicast broadcast initialiation message. In the same
way, the sender must not accept any acknowledges until it has itself
received the broadcast initialization from the peer, as well as
confirmation of the reception of its own initialization message.

Furthermore, when a receiver peer goes down, the sender has to produce
the missing acknowledges from the lost peer locally, in order ensure
correct release of the buffers that were expected to be acknowledged by
the said peer.

In a highly stressed system we have observed that contact with a peer
may come up and be lost before the above mentioned broadcast initial-
ization and confirmation have been received. This leads to the locally
produced acknowledges being rejected, and the non-acknowledged buffers
to linger in the broadcast link transmission queue until it fills up
and the link goes into permanent congestion.

In this commit, we remedy this by temporarily setting the corresponding
broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
state to true before we issue the local acknowledges. This ensures that
those acknowledges will always be accepted. The mentioned state values
are restored immediately afterwards when the link is reset.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:42:12 -07:00
Jon Paul Maloy 2d18ac4ba7 tipc: extend broadcast link initialization criteria
At first contact between two nodes, an endpoint might sometimes have
time to send out a LINK_PROTOCOL/STATE packet before it has received
the broadcast initialization packet from the peer, i.e., before it has
received a valid broadcast packet number to add to the 'bc_ack' field
of the protocol message.

This means that the peer endpoint will receive a protocol packet with an
invalid broadcast acknowledge value of 0. Under unlucky circumstances
this may lead to the original, already received acknowledge value being
overwritten, so that the whole broadcast link goes stale after a while.

We fix this by delaying the setting of the link field 'bc_peer_is_up'
until we know that the peer really has received our own broadcast
initialization message. The latter is always sent out as the first
unicast message on a link, and always with seqeunce number 1. Because
of this, we only need to look for a non-zero unicast acknowledge value
in the arriving STATE messages, and once that is confirmed we know we
are safe and can set the mentioned field. Before this moment, we must
ignore all broadcast acknowledges from the peer.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11 22:42:12 -07:00
David S. Miller 30d0844bdc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx5/core/en.h
	drivers/net/ethernet/mellanox/mlx5/core/en_main.c
	drivers/net/usb/r8152.c

All three conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-06 10:35:22 -07:00
Richard Alpe 55e77a3e82 tipc: fix nl compat regression for link statistics
Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
of the link name where left out.

Making the output of tipc-config -ls look something like:
Link statistics:
dcast-link
1:data0-1.1.2:data0
1:data0-1.1.3:data0

Also, for the record, the patch that introduce this regression
claims "Sending the whole object out can cause a leak". Which isn't
very likely as this is a compat layer, where the data we are parsing
is generated by us and we know the string to be NULL terminated. But
you can of course never be to secure.

Fixes: 5d2be1422e (tipc: fix an infoleak in tipc_nl_compat_link_dump)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01 16:47:38 -04:00
David S. Miller ee58b57100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-30 05:03:36 -04:00
Richard Alpe bc3a334cc2 tipc: rename udp_port in struct udp_media_addr
Context implies that port in struct "udp_media_addr" is referring
to a UDP port.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-29 05:17:37 -04:00
Richard Alpe e99429232e tipc: honor msg2addr return value
The UDP msg2addr function tipc_udp_msg2addr() can return -EINVAL which
prior to this patch was unhanded in the caller.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-29 05:17:37 -04:00
Amitoj Kaur Chawla 810bf11033 tipc: Use kmemdup instead of kmalloc and memcpy
Replace calls to kmalloc followed by a memcpy with a direct call to
kmemdup.

The Coccinelle semantic patch used to make this change is as follows:
@@
expression from,to,size,flag;
statement S;
@@

-  to = \(kmalloc\|kzalloc\)(size,flag);
+  to = kmemdup(from,size,flag);
   if (to==NULL || ...) S
-  memcpy(to, from, size);

Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27 09:56:58 -04:00
Jon Paul Maloy 27777daa8b tipc: unclone unbundled buffers before forwarding
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.

We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.

Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-22 16:33:35 -04:00
Jon Paul Maloy f1d048f24e tipc: fix socket timer deadlock
We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.

The scenario is as follows:

CPU 1:                          CPU 2:
--------                        --------
tipc_sk_timeout(sk1)            tipc_sk_timeout(sk2)
  lock(sk1.slock)                 lock(sk2.slock)
  msg_create(probe)               msg_create(probe)
  unlock(sk1.slock)               unlock(sk2.slock)
  tipc_node_xmit_skb()            tipc_node_xmit_skb()
    tipc_node_xmit()                tipc_node_xmit()
      tipc_sk_rcv(sk2)                tipc_sk_rcv(sk1)
        lock(sk2.slock)                 lock((sk1.slock)
        filter_rcv()                    filter_rcv()
          tipc_sk_proto_rcv()             tipc_sk_proto_rcv()
            msg_create(probe_rsp)           msg_create(probe_rsp)
            tipc_sk_respond()               tipc_sk_respond()
              tipc_node_xmit_skb()            tipc_node_xmit_skb()
                tipc_node_xmit()                tipc_node_xmit()
                  tipc_sk_rcv(sk1)                tipc_sk_rcv(sk2)
                    lock((sk1.slock)                lock((sk2.slock)
                    ===> DEADLOCK                   ===> DEADLOCK

Further analysis reveals that there are three different locations in the
socket code where tipc_sk_respond() is called within the context of the
socket lock, with ensuing risk of similar deadlocks.

We now solve this by passing a buffer queue along with all upcalls where
sk_lock.slock may potentially be held. Response or rejected message
buffers are accumulated into this queue instead of being sent out
directly, and only sent once we know we are safely outside the slock
context.

Reported-by: GUNA <gbalasun@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 21:38:10 -07:00
Dan Carpenter 0350cb48fb tipc: potential shift wrapping bug in map_set()
"up_map" is a u64 type but we're not using the high 32 bits.

Fixes: 35c55c9877 ('tipc: add neighbor monitoring framework')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-17 21:26:03 -07:00
Ying Xue c91522f860 tipc: eliminate uninitialized variable warning
net/tipc/link.c: In function ‘tipc_link_timeout’:
net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]

Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 21:47:23 -07:00
Ying Xue 66d95b6705 tipc: fix suspicious RCU usage
When run tipcTS&tipcTC test suite, the following complaint appears:

[   56.926168] ===============================
[   56.926169] [ INFO: suspicious RCU usage. ]
[   56.926171] 4.7.0-rc1+ #160 Not tainted
[   56.926173] -------------------------------
[   56.926174] net/tipc/bearer.c:408 suspicious rcu_dereference_protected() usage!
[   56.926175]
[   56.926175] other info that might help us debug this:
[   56.926175]
[   56.926177]
[   56.926177] rcu_scheduler_active = 1, debug_locks = 1
[   56.926179] 3 locks held by swapper/4/0:
[   56.926180]  #0:  (((&req->timer))){+.-...}, at: [<ffffffff810e79b5>] call_timer_fn+0x5/0x340
[   56.926203]  #1:  (&(&req->lock)->rlock){+.-...}, at: [<ffffffffa000c29b>] disc_timeout+0x1b/0xd0 [tipc]
[   56.926212]  #2:  (rcu_read_lock){......}, at: [<ffffffffa00055e0>] tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
[   56.926218]
[   56.926218] stack backtrace:
[   56.926221] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.7.0-rc1+ #160
[   56.926222] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   56.926224]  0000000000000000 ffff880016803d28 ffffffff813c4423 ffff8800154252c0
[   56.926227]  0000000000000001 ffff880016803d58 ffffffff810b7512 ffff8800124d8120
[   56.926230]  ffff880013f8a160 ffff8800132b5ccc ffff8800124d8120 ffff880016803d88
[   56.926234] Call Trace:
[   56.926235]  <IRQ>  [<ffffffff813c4423>] dump_stack+0x67/0x94
[   56.926250]  [<ffffffff810b7512>] lockdep_rcu_suspicious+0xe2/0x120
[   56.926256]  [<ffffffffa00051f1>] tipc_l2_send_msg+0x131/0x1c0 [tipc]
[   56.926261]  [<ffffffffa000567c>] tipc_bearer_xmit_skb+0x14c/0x2e0 [tipc]
[   56.926266]  [<ffffffffa00055e0>] ? tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
[   56.926273]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
[   56.926278]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
[   56.926283]  [<ffffffffa000c2d6>] disc_timeout+0x56/0xd0 [tipc]
[   56.926288]  [<ffffffff810e7a68>] call_timer_fn+0xb8/0x340
[   56.926291]  [<ffffffff810e79b5>] ? call_timer_fn+0x5/0x340
[   56.926296]  [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
[   56.926300]  [<ffffffff810e8f4a>] run_timer_softirq+0x23a/0x390
[   56.926306]  [<ffffffff810f89ff>] ? clockevents_program_event+0x7f/0x130
[   56.926316]  [<ffffffff819727c3>] __do_softirq+0xc3/0x4a2
[   56.926323]  [<ffffffff8106ba5a>] irq_exit+0x8a/0xb0
[   56.926327]  [<ffffffff81972456>] smp_apic_timer_interrupt+0x46/0x60
[   56.926331]  [<ffffffff81970a49>] apic_timer_interrupt+0x89/0x90
[   56.926333]  <EOI>  [<ffffffff81027fda>] ? default_idle+0x2a/0x1a0
[   56.926340]  [<ffffffff81027fd8>] ? default_idle+0x28/0x1a0
[   56.926342]  [<ffffffff810289cf>] arch_cpu_idle+0xf/0x20
[   56.926345]  [<ffffffff810adf0f>] default_idle_call+0x2f/0x50
[   56.926347]  [<ffffffff810ae145>] cpu_startup_entry+0x215/0x3e0
[   56.926353]  [<ffffffff81040ad9>] start_secondary+0xf9/0x100

The warning appears as rtnl_dereference() is wrongly used in
tipc_l2_send_msg() under RCU read lock protection. Instead the proper
usage should be that rcu_dereference_rtnl() is called here.

Fixes: 5b7066c3dd ("tipc: stricter filtering of packets in bearer layer")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 21:47:23 -07:00
Jon Paul Maloy 35c55c9877 tipc: add neighbor monitoring framework
TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.

This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.

This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:

- Each node makes up a linearly ascending, circular list of all its N
  known neighbors, based on their TIPC node identity. This algorithm
  must be the same on all nodes.

- The node then selects the next M = sqrt(N) - 1 nodes downstream from
  itself in the list, and chooses to actively monitor those. This is
  called its "local monitoring domain".

- It creates a domain record describing the monitoring domain, and
  piggy-backs this in the data area of all neighbor monitoring messages
  (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
  the cluster eventually (default within 400 ms) will learn about
  its monitoring domain.

- Whenever a node discovers a change in its local domain, e.g., a node
  has been added or has gone down, it creates and sends out a new
  version of its node record to inform all neighbors about the change.

- A node receiving a domain record from anybody outside its local domain
  matches this against its own list (which may not look the same), and
  chooses to not actively monitor those members of the received domain
  record that are also present in its own list. Instead, it relies on
  indications from the direct monitoring nodes if an indirectly
  monitored node has gone up or down. If a node is indicated lost, the
  receiving node temporarily activates its own direct monitoring towards
  that node in order to confirm, or not, that it is actually gone.

- Since each node is actively monitoring sqrt(N) downstream neighbors,
  each node is also actively monitored by the same number of upstream
  neighbors. This means that all non-direct monitoring nodes normally
  will receive sqrt(N) indications that a node is gone.

- A major drawback with ring monitoring is how it handles failures that
  cause massive network partitionings. If both a lost node and all its
  direct monitoring neighbors are inside the lost partition, the nodes in
  the remaining partition will never receive indications about the loss.
  To overcome this, each node also chooses to actively monitor some
  nodes outside its local domain. Those nodes are called remote domain
  "heads", and are selected in such a way that no node in the cluster
  will be more than two direct monitoring hops away. Because of this,
  each node, apart from monitoring the member of its local domain, will
  also typically monitor sqrt(N) remote head nodes.

- As an optimization, local list status, domain status and domain
  records are marked with a generation number. This saves senders from
  unnecessarily conveying  unaltered domain records, and receivers from
  performing unneeded re-adaptations of their node monitoring list, such
  as re-assigning domain heads.

- As a measure of caution we have added the possibility to disable the
  new algorithm through configuration. We do this by keeping a threshold
  value for the cluster size; a cluster that grows beyond this value
  will switch from full-mesh to ring monitoring, and vice versa when
  it shrinks below the value. This means that if the threshold is set to
  a value larger than any anticipated cluster size (default size is 32)
  the new algorithm is effectively disabled. A patch set for altering the
  threshold value and for listing the table contents will follow shortly.

- This change is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-15 14:06:28 -07:00
David S. Miller 1578b0a5e9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/sched/act_police.c
	net/sched/sch_drr.c
	net/sched/sch_hfsc.c
	net/sched/sch_prio.c
	net/sched/sch_red.c
	net/sched/sch_tbf.c

In net-next the drop methods of the packet schedulers got removed, so
the bug fixes to them in 'net' are irrelevant.

A packet action unload crash fix conflicts with the addition of the
new firstuse timestamp.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 11:52:24 -07:00
Jon Paul Maloy 5ca509fc0b tipc: change node timer unit from jiffies to ms
The node keepalive interval is recalculated at each timer expiration
to catch any changes in the link tolerance, and stored in a field in
struct tipc_node. We use jiffies as unit for the stored value.

This is suboptimal, because it makes the calculation unnecessary
complex, including two unit conversions. The conversions also lead to
a rounding error that causes the link "abort limit" to be 3 in the
normal case, instead of 4, as intended. This again leads to unnecessary
link resets when the network is pushed close to its limit, e.g., in an
environment with hundreds of nodes or namesapces.

In this commit, we do instead let the keepalive value be calculated and
stored in milliseconds, so that there is only one conversion and the
rounding error is eliminated.

We also remove a redundant "keepalive" field in struct tipc_link. This
is remnant from the previous implementation.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 11:27:02 -07:00
Jon Paul Maloy c4282ca76c tipc: correct error in node fsm
commit 88e8ac7000 ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47 ("tipc: extend node FSM").

We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
   is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
   event will not change the node FSM state, but the (distinct) link FSM
   will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
   declare node A lost, and post the event SELF_DOWN to the its node
   FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
   that no messages will be accepted from A until it receives another
   RESET message that confirms that A's endpoint has been reset. This
   is  wasteful, since we know this as a fact already from the first
   received RESET, but worse is that the link instance's FSM has not
   wasted this information, but instead moved on to state ESTABLISHING,
   meaning that it repeatedly sends out ACTIVATE messages to the reset
   peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
   to state ESTABLISHED, and start repeatedly sending out STATE messages
   to node B.
4: Node B will consistently drop these messages, since it can only accept
   accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
   repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
   likely that A will receive an ACTIVATE (which is sent out at a much
   higher frequency) before it gets the chance to send a RESET, and A
   may hence quickly move back to state ESTABLISHED and continue sending
   out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
   by chance get in between with sending a RESET, and the situation is
   resolved.

Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.

We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.

Finally, a symmetry comparison reveals that the  FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.

The node FSM looks as follows after those changes:

                           +----------------------------------------+
                           |                           PEER_DOWN_EVT|
                           |                                        |
  +------------------------+----------------+                       |
  |SELF_DOWN_EVT           |                |                       |
  |                        |                |                       |
  |              +-----------+          +-----------+               |
  |              |NODE_      |          |NODE_      |               |
  |   +----------|FAILINGOVER|<---------|SYNCHING   |-----------+   |
  |   |SELF_     +-----------+ FAILOVER_+-----------+   PEER_   |   |
  |   |DOWN_EVT   |          A BEGIN_EVT  A         |   DOWN_EVT|   |
  |   |           |          |            |         |           |   |
  |   |           |          |            |         |           |   |
  |   |           |FAILOVER_ |FAILOVER_   |SYNCH_   |SYNCH_     |   |
  |   |           |END_EVT   |BEGIN_EVT   |BEGIN_EVT|END_EVT    |   |
  |   |           |          |            |         |           |   |
  |   |           |          |            |         |           |   |
  |   |           |         +--------------+        |           |   |
  |   |           +-------->|   SELF_UP_   |<-------+           |   |
  |   |   +-----------------|   PEER_UP    |----------------+   |   |
  |   |   |SELF_DOWN_EVT    +--------------+   PEER_DOWN_EVT|   |   |
  |   |   |                    A        A                   |   |   |
  |   |   |                    |        |                   |   |   |
  |   |   |         PEER_UP_EVT|        |SELF_UP_EVT        |   |   |
  |   |   |                    |        |                   |   |   |
  V   V   V                    |        |                   V   V   V
+------------+       +-----------+    +-----------+       +------------+
|SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
|PEER_LEAVING|       |PEER_COMING|    |SELF_COMING|       |SELF_LEAVING|
+------------+       +-----------+    +-----------+       +------------+
       |               |       A        A       |                |
       |               |       |        |       |                |
       |       SELF_   |       |SELF_   |PEER_  |PEER_           |
       |       DOWN_EVT|       |UP_EVT  |UP_EVT |DOWN_EVT        |
       |               |       |        |       |                |
       |               |       |        |       |                |
       |               |    +--------------+    |                |
       |PEER_DOWN_EVT  +--->|  SELF_DOWN_  |<---+   SELF_DOWN_EVT|
       +------------------->|  PEER_DOWN   |<--------------------+
                            +--------------+

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 11:27:01 -07:00
Kangjie Lu 5d2be1422e tipc: fix an infoleak in tipc_nl_compat_link_dump
link_info.str is a char array of size 60. Memory after the NULL
byte is not initialized. Sending the whole object out can cause
a leak.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-02 21:32:37 -07:00
Baozeng Ding 297f7d2cce tipc: fix potential null pointer dereferences in some compat functions
Before calling the nla_parse_nested function, make sure the pointer to the
attribute is not null. This patch fixes several potential null pointer
dereference vulnerabilities in the tipc netlink functions.

Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-25 12:33:52 -07:00
Eric Dumazet b91083a45e tipc: block BH in TCP callbacks
TCP stack can now run from process context.

Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.

Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-19 11:36:49 -07:00
Linus Torvalds 16bf834805 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits)
  gitignore: fix wording
  mfd: ab8500-debugfs: fix "between" in printk
  memstick: trivial fix of spelling mistake on management
  cpupowerutils: bench: fix "average"
  treewide: Fix typos in printk
  IB/mlx4: printk fix
  pinctrl: sirf/atlas7: fix printk spelling
  serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/
  w1: comment spelling s/minmum/minimum/
  Blackfin: comment spelling s/divsor/divisor/
  metag: Fix misspellings in comments.
  ia64: Fix misspellings in comments.
  hexagon: Fix misspellings in comments.
  tools/perf: Fix misspellings in comments.
  cris: Fix misspellings in comments.
  c6x: Fix misspellings in comments.
  blackfin: Fix misspelling of 'register' in comment.
  avr32: Fix misspelling of 'definitions' in comment.
  treewide: Fix typos in printk
  Doc: treewide : Fix typos in DocBook/filesystem.xml
  ...
2016-05-17 17:05:30 -07:00
Richard Alpe 03aaaa9b94 tipc: fix nametable publication field in nl compat
The publication field of the old netlink API should contain the
publication key and not the publication reference.

Fixes: 44a8ae94fd (tipc: convert legacy nl name table dump to nl compat)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-17 12:34:02 -04:00
Richard Alpe 45e093ae28 tipc: check nl sock before parsing nested attributes
Make sure the socket for which the user is listing publication exists
before parsing the socket netlink attributes.

Prior to this patch a call without any socket caused a NULL pointer
dereference in tipc_nl_publ_dump().

Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.cm>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 21:58:54 -04:00
Jon Paul Maloy e7142c341c tipc: eliminate risk of double link_up events
When an ACTIVATE or data packet is received in a link in state
ESTABLISHING, the link does not immediately change state to
ESTABLISHED, but does instead return a LINK_UP event to the caller,
which will execute the state change in a different lock context.

This non-atomic approach incurs a low risk that we may have two
LINK_UP events pending simultaneously for the same link, resulting
in the final part of the setup procedure being executed twice. The
only potential harm caused by this it that we may see two LINK_UP
events issued to subsribers of the topology server, something that
may cause confusion.

This commit eliminates this risk by checking if the link is already
up before proceeding with the second half of the setup.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 17:11:27 -04:00
David S. Miller cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Jon Paul Maloy 10724cc7bb tipc: redesign connection-level flow control
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.

Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.

A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.

This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.

This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.

In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:16 -04:00
Jon Paul Maloy 60020e1857 tipc: propagate peer node capabilities to socket layer
During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.

This commit adds this functionality.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:15 -04:00
Jon Paul Maloy 7c8bcfb125 tipc: re-enable compensation for socket receive buffer double counting
In the refactoring commit d570d86497 ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test

if (sk->sk_backlog.len == 0)
     atomic_set(&tsk->dupl_rcvcnt, 0);

with

if (sk->sk_backlog.len)
     atomic_set(&tsk->dupl_rcvcnt, 0);

This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.

We now fix this by inverting the mentioned condition.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 15:51:14 -04:00
Hamish Martin efe790502b tipc: only process unicast on intended node
We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3d ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.

While we are aware that this change only patches over a root problem that
we still haven't identified, this is a sanity test that it is always
legitimate to do. It will remain in the code even after we identify and
fix the real problem.

Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 21:03:30 -04:00
Jon Paul Maloy def22c47d7 tipc: set 'active' state correctly for first established link
When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.

This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-01 19:40:22 -04:00
Dan Carpenter b43586576e tipc: remove an unnecessary NULL check
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:54:12 -04:00
Parthasarathy Bhuvaragan 8cee83dd29 tipc: fix stale links after re-enabling bearer
Commit 42b18f605f ("tipc: refactor function tipc_link_timeout()"),
introduced a bug which prevents sending of probe messages during
link synchronization phase. This leads to hanging links, if the
bearer is disabled/enabled after links are up.

In this commit, we send the probe messages correctly.

Fixes: 42b18f605f ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:35:07 -04:00
David S. Miller 1602f49b58 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were two cases of simple overlapping changes,
nothing serious.

In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23 18:51:33 -04:00
Masanari Iida c19ca6cb4c treewide: Fix typos in printk
This patch fix spelling typos found in printk
within various part of the kernel sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:23:24 +02:00
Jon Paul Maloy 34b9cd64c8 tipc: let first message on link be a state message
According to the link FSM, a received traffic packet can take a link
from state ESTABLISHING to ESTABLISHED, but the link can still not be
fully set up in one atomic operation. This means that even if the the
very first packet on the link is a traffic packet with sequence number
1 (one), it has to be dropped and retransmitted.

This can be avoided if we let the mentioned packet be preceded by a
LINK_PROTOCOL/STATE message, which takes up the endpoint before the
arrival of the traffic.

We add this small feature in this commit.

This is a fully compatible change.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:06 -04:00
Jon Paul Maloy de7e07f9ee tipc: ensure that first packets on link are sent in order
In some link establishment scenarios we see that packet #2 may be sent
out before packet #1, forcing the receiver to demand retransmission of
the missing packet. This is harmless, but may cause confusion among
people tracing the packet flow.

Since this is extremely easy to fix, we do so by adding en extra send
call to the bearer immediately after the link has come up.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:06 -04:00
Jon Paul Maloy 42b18f605f tipc: refactor function tipc_link_timeout()
The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.

We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:06 -04:00
Jon Paul Maloy 88e8ac7000 tipc: reduce transmission rate of reset messages when link is down
When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.

We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.

This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.

The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:05 -04:00
Jon Paul Maloy 634696b197 tipc: guarantee peer bearer id exchange after reboot
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.

According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.

However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.

The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.

In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.

Although never reported, this problem seems to have always been around.

This protocol addition is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-15 16:09:05 -04:00
Parthasarathy Bhuvaragan 333f796235 tipc: fix a race condition leading to subscriber refcnt bug
Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.

This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.

If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:

[31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380         [tipc]
[31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
[31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
[...]
[31.228377] Call Trace:
[31.228377]  [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
[31.228377]  [<ffffffff8105a311>] __warn+0xd1/0xf0
[31.228377]  [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
[31.228377]  [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228377]  [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
[31.228377]  [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
[31.228377]  [<ffffffff81071925>] process_one_work+0x145/0x3d0
[31.246554] ---[ end trace c3882c9baa05a4fd ]---
[31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
[31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
[31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
[31.249323] PGD 0
[31.249323] Oops: 0000 [#1] SMP

In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
  to a new function tipc_sock_release(), which is executed before
  we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:46:46 -04:00
Jon Paul Maloy 7d45a04cbc tipc: remove remnants of old broadcast code
We remove a couple of leftover fields in struct tipc_bearer. Those
were used by the old broadcast implementation, and are not needed
any longer. There is no functional changes in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13 17:49:11 -04:00
Erik Hugne ddb1d33969 tipc: purge deferred updates from dead nodes
If a peer node becomes unavailable, in addition to removing the
nametable entries from this node we also need to purge all deferred
updates associated with this node.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:22:20 -04:00
Erik Hugne 541726abe7 tipc: make dist queue pernet
Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.

Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-11 15:22:20 -04:00
Jon Paul Maloy 5b7066c3dd tipc: stricter filtering of packets in bearer layer
Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.

To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only.  This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 17:00:13 -04:00
Jon Paul Maloy 4e801fa14f tipc: eliminate buffer leak in bearer layer
When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a  bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.

This commit fixes this problem by introducing two changes:

1) Instead of attemting to send the discovery message directly, we let
   tipc_disc_create() return the discovery buffer to the calling
   function, tipc_enable_bearer(), so that the latter can send it
   when the enabling sequence is finished.

2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
   functions at the bearer layer, we now free the indicated buffer or
   buffer chain when a valid bearer cannot be found.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 17:00:13 -04:00
Richard Alpe 9bd160bfa2 tipc: make sure IPv6 header fits in skb headroom
Expand headroom further in order to be able to fit the larger IPv6
header. Prior to this patch this caused a skb under panic for certain
tipc packets when using IPv6 UDP bearer(s).

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-14 12:23:12 -04:00
Daniel Borkmann 134611446d ip_tunnel: add support for setting flow label via collect metadata
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-11 15:14:26 -05:00
David S. Miller 810813c47a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 12:34:12 -05:00
Richard Alpe 49cc66eaee tipc: move netlink policies to netlink.c
Make the c files less cluttered and enable netlink attributes to be
shared between files.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-07 14:56:41 -05:00
Jon Paul Maloy e74a386d70 tipc: remove pre-allocated message header in link struct
Until now, we have kept a pre-allocated protocol message header
aggregated into struct tipc_link. Apart from adding unnecessary
footprint to the link instances, this requires extra code both to
initialize and re-initialize it.

We now remove this sub-optimization. This change also makes it
possible to clean up the function tipc_build_proto_msg() and remove
a couple of small functions that were accessing the mentioned header.
In particular, we can replace all occurrences of the local function
call link_own_addr(link) with the generic tipc_own_addr(net).

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 23:01:20 -05:00
Parthasarathy Bhuvaragan 4de13d7ed6 tipc: fix nullptr crash during subscription cancel
commit 4d5cfcba2f ('tipc: fix connection abort during subscription
cancel'), removes the check for a valid subscription before calling
tipc_nametbl_subscribe().

This will lead to a nullptr exception when we process a
subscription cancel request. For a cancel request, a null
subscription is passed to tipc_nametbl_subscribe() resulting
in exception.

In this commit, we call tipc_nametbl_subscribe() only for
a valid subscription.

Fixes: 4d5cfcba2f ('tipc: fix connection abort during subscription cancel')
Reported-by: Anders Widell <anders.widell@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 23:00:08 -05:00
Richard Alpe 34f65dbb6c tipc: make sure required IPv6 addresses are scoped
Make sure the user has provided a scope for multicast and link local
addresses used locally by a UDP bearer.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:57 -05:00
Richard Alpe ddb3712552 tipc: safely copy UDP netlink data from user
The netlink policy for TIPC_NLA_UDP_LOCAL and TIPC_NLA_UDP_REMOTE
is of type binary with a defined length. This causes the policy
framework to threat the defined length as maximum length.

There is however no protection against a user sending a smaller
amount of data. Prior to this patch this wasn't handled which could
result in a partially incomplete sockaddr_storage struct containing
uninitialized data.

In this patch we use nla_memcpy() when copying the user data. This
ensures a potential gap at the end is cleared out properly.

This was found by Julia with Coccinelle tool.

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:57 -05:00
Richard Alpe 2837f39c7c tipc: don't check link reset on non existing link
Make sure we have a link before checking if it has been reset or not.

Prior to this patch tipc_link_is_reset() could be called with a non
existing link, resulting in a null pointer dereference.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:56 -05:00
Richard Alpe 9b3009604b tipc: add net device to skb before UDP xmit
Prior to this patch enabling a IPv4 UDP bearer caused a null pointer
dereference in iptunnel_xmit_stats(), when it tried to dereference the
net device from the skb. To resolve this we now point the skb device
to the net device resolved from the routing table.

Fixes: 039f50629b (ip_tunnel: Move stats update to iptunnel_xmit())
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-06 22:54:56 -05:00
Parthasarathy Bhuvaragan f214fc4029 tipc: Revert "tipc: use existing sk_write_queue for outgoing packet chain"
reverts commit 94153e36e7 ("tipc: use existing sk_write_queue for
outgoing packet chain")

In Commit 94153e36e7, we assume that we fill & empty the socket's
sk_write_queue within the same lock_sock() session.

This is not true if the link is congested. During congestion, the
socket lock is released while we wait for the congestion to cease.
This implementation causes a nullptr exception, if the user space
program has several threads accessing the same socket descriptor.

Consider two threads of the same program performing the following:
     Thread1                                  Thread2
--------------------                    ----------------------
Enter tipc_sendmsg()                    Enter tipc_sendmsg()
lock_sock()                             lock_sock()
Enter tipc_link_xmit(), ret=ELINKCONG   spin on socket lock..
sk_wait_event()                             :
release_sock()                          grab socket lock
    :                                   Enter tipc_link_xmit(), ret=0
    :                                   release_sock()
Wakeup after congestion
lock_sock()
skb = skb_peek(pktchain);
!! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong;

In this case, the second thread transmits the buffers belonging to
both thread1 and thread2 successfully. When the first thread wakeup
after the congestion it assumes that the pktchain is intact and
operates on the skb's in it, which leads to the following exception:

[2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0
[2102.440074] Oops: 0000 [#1] SMP
[2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1
[2102.440074] RIP: 0010:[<ffffffffa005f330>]  [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[...]
[2102.440074] Call Trace:
[2102.440074]  [<ffffffff8163f0b9>] ? schedule+0x29/0x70
[2102.440074]  [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc]
[2102.440074]  [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc]
[2102.440074]  [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc]
[2102.440074]  [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20
[2102.440074]  [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc]
[2102.440074]  [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0
[2102.440074]  [<ffffffff81507895>] ? release_sock+0x145/0x170
[2102.440074]  [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0
[2102.440074]  [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10
[2102.440074]  [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0
[2102.440074]  [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0
[2102.440074]  [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20
[2102.440074]  [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0
[2102.440074]  [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950
[2102.440074]  [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80
[2102.440074]  [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20
[2102.440074]  [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b

In this commit, we maintain the skb list always in the stack.

Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-03 16:30:29 -05:00
Florian Westphal 619b17452a tipc: fix null deref crash in compat config path
msg.dst_sk needs to be set up with a valid socket because some callbacks
later derive the netns from it.

Fixes: 263ea09084d172d ("Revert "genl: Add genlmsg_new_unicast() for unicast message allocation")
Reported-by: Jon Maloy <maloy@donjonn.com>
Bisected-by: Jon Maloy <maloy@donjonn.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 17:04:48 -05:00
Jon Paul Maloy d25a01257e tipc: fix crash during node removal
When the TIPC module is unloaded, we have identified a race condition
that allows a node reference counter to go to zero and the node instance
being freed before the node timer is finished with accessing it. This
leads to occasional crashes, especially in multi-namespace environments.

The scenario goes as follows:

CPU0:(node_stop)                       CPU1:(node_timeout)  // ref == 2

1:                                          if(!mod_timer())
2: if (del_timer())
3:   tipc_node_put()                                        // ref -> 1
4: tipc_node_put()                                          // ref -> 0
5:   kfree_rcu(node);
6:                                               tipc_node_get(node)
7:                                               // BOOM!

We now clean up this functionality as follows:

1) We remove the node pointer from the node lookup table before we
   attempt deactivating the timer. This way, we reduce the risk that
   tipc_node_find() may obtain a valid pointer to an instance marked
   for deletion; a harmless but undesirable situation.

2) We use del_timer_sync() instead of del_timer() to safely deactivate
   the node timer without any risk that it might be reactivated by the
   timeout handler. There is no risk of deadlock here, since the two
   functions never touch the same spinlocks.

3: We remove a pointless tipc_node_get() + tipc_node_put() from the
   timeout handler.

Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 17:04:48 -05:00
Jon Paul Maloy b170997ace tipc: eliminate risk of finding to-be-deleted node instance
Although we have never seen it happen, we have identified the
following problematic scenario when nodes are stopped and deleted:

CPU0:                            CPU1:

tipc_node_xxx()                                   //ref == 1
   tipc_node_put()                                //ref -> 0
                                 tipc_node_find() // node still in table
       tipc_node_delete()
         list_del_rcu(n. list)
                                 tipc_node_get()  //ref -> 1, bad
         kfree_rcu()

                                 tipc_node_put() //ref to 0 again.
                                 kfree_rcu()     // BOOM!

We fix this by introducing use of the conditional kref_get_if_not_zero()
instead of kref_get() in the function tipc_node_find(). This eliminates
any risk of post-mortem access.

Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-25 17:04:48 -05:00
David S. Miller b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Insu Yun b53ce3e7d4 tipc: unlock in error path
tipc_bcast_unlock need to be unlocked in error path.

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:38:44 -05:00
Florian Westphal 263ea09084 Revert "genl: Add genlmsg_new_unicast() for unicast message allocation"
This reverts commit bb9b18fb55 ("genl: Add genlmsg_new_unicast() for
unicast message allocation")'.

Nothing wrong with it; its no longer needed since this was only for
mmapped netlink support.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:19 -05:00
Richard Alpe 4952cd3e7b tipc: refactor node xmit and fix memory leaks
Refactor tipc_node_xmit() to fail fast and fail early. Fix several
potential memory leaks in unexpected error paths.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 15:58:40 -05:00
Jon Paul Maloy d5c91fb72f tipc: fix premature addition of node to lookup table
In commit 5266698661 ("tipc: let broadcast packet reception
use new link receive function") we introduced a new per-node
broadcast reception link instance. This link is created at the
moment the node itself is created. Unfortunately, the allocation
is done after the node instance has already been added to the node
lookup hash table. This creates a potential race condition, where
arriving broadcast packets are able to find and access the node
before it has been fully initialized, and before the above mentioned
link has been created. The result is occasional crashes in the function
tipc_bcast_rcv(), which is trying to access the not-yet existing link.

We fix this by deferring the addition of the node instance until after
it has been fully initialized in the function tipc_node_create().

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-16 15:57:11 -05:00
Parthasarathy Bhuvaragan 06c8581f85 tipc: use alloc_ordered_workqueue() instead of WQ_UNBOUND w/ max_active = 1
Until now, tipc_rcv and tipc_send workqueues in server are allocated
with parameters WQ_UNBOUND & max_active = 1.
This parameters passed to this function makes it equivalent to
alloc_ordered_workqueue(). The later form is more explicit and
can inherit future ordered_workqueue changes.

In this commit we replace alloc_workqueue() with more readable
alloc_ordered_workqueue().

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan ae245557f8 tipc: donot create timers if subscription timeout = TIPC_WAIT_FOREVER
Until now, we create timers even for the subscription requests
with timeout = TIPC_WAIT_FOREVER.
This can be improved by avoiding timer creation when the timeout
is set to TIPC_WAIT_FOREVER.

In this commit, we introduce a check to creates timers only
when timeout != TIPC_WAIT_FOREVER.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan f3ad288c56 tipc: protect tipc_subscrb_get() with subscriber spin lock
Until now, during subscription creation the mod_time() &
tipc_subscrb_get() are called after releasing the subscriber
spin lock.

In a SMP system when performing a subscription creation, if the
subscription timeout occurs simultaneously (the timer is
scheduled to run on another CPU) then the timer thread
might decrement the subscribers refcount before the create
thread increments the refcount.

This can be simulated by creating subscription with timeout=0 and
sometimes the timeout occurs before the create request is complete.
This leads to the following message:
[30.702949] BUG: spinlock bad magic on CPU#1, kworker/u8:3/87
[30.703834] general protection fault: 0000 [#1] SMP
[30.704826] CPU: 1 PID: 87 Comm: kworker/u8:3 Not tainted 4.4.0-rc8+ #18
[30.704826] Workqueue: tipc_rcv tipc_recv_work [tipc]
[30.704826] task: ffff88003f878600 ti: ffff88003fae0000 task.ti: ffff88003fae0000
[30.704826] RIP: 0010:[<ffffffff8109196c>]  [<ffffffff8109196c>] spin_dump+0x5c/0xe0
[...]
[30.704826] Call Trace:
[30.704826]  [<ffffffff81091a16>] spin_bug+0x26/0x30
[30.704826]  [<ffffffff81091b75>] do_raw_spin_lock+0xe5/0x120
[30.704826]  [<ffffffff81684439>] _raw_spin_lock_bh+0x19/0x20
[30.704826]  [<ffffffffa0096f10>] tipc_subscrb_rcv_cb+0x1d0/0x330 [tipc]
[30.704826]  [<ffffffffa00a37b1>] tipc_receive_from_sock+0xc1/0x150 [tipc]
[30.704826]  [<ffffffffa00a31df>] tipc_recv_work+0x3f/0x80 [tipc]
[30.704826]  [<ffffffff8106a739>] process_one_work+0x149/0x3c0
[30.704826]  [<ffffffff8106aa16>] worker_thread+0x66/0x460
[30.704826]  [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0
[30.704826]  [<ffffffff8106a9b0>] ? process_one_work+0x3c0/0x3c0
[30.704826]  [<ffffffff8107029d>] kthread+0xed/0x110
[30.704826]  [<ffffffff810701b0>] ? kthread_create_on_node+0x190/0x190
[30.704826]  [<ffffffff81684bdf>] ret_from_fork+0x3f/0x70

In this commit,
1. we remove the check for the return code for mod_timer()
2. we protect tipc_subscrb_get() using the subscriber spin lock.
   We increment the subscriber's refcount as soon as we add the
   subscription to subscriber's subscription list.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan d4091899c9 tipc: hold subscriber->lock for tipc_nametbl_subscribe()
Until now, while creating a subscription the subscriber lock
protects only the subscribers subscription list and not the
nametable. The call to tipc_nametbl_subscribe() is outside
the lock. However, at subscription timeout and cancel both
the subscribers subscription list and the nametable are
protected by the subscriber lock.

This asymmetric locking mechanism leads to the following problem:
In a SMP system, the timer can be fire on another core before
the create request is complete.
When the timer thread calls tipc_nametbl_unsubscribe() before create
thread calls tipc_nametbl_subscribe(), we get a nullptr exception.

This can be simulated by creating subscription with timeout=0 and
sometimes the timeout occurs before the create request is complete.

The following is the oops:
[57.569661] BUG: unable to handle kernel NULL pointer dereference at (null)
[57.577498] IP: [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/0x120 [tipc]
[57.584820] PGD 0
[57.586834] Oops: 0002 [#1] SMP
[57.685506] CPU: 14 PID: 10077 Comm: kworker/u40:1 Tainted: P OENX 3.12.48-52.27.1.     9688.1.PTF-default #1
[57.703637] Workqueue: tipc_rcv tipc_recv_work [tipc]
[57.708697] task: ffff88064c7f00c0 ti: ffff880629ef4000 task.ti: ffff880629ef4000
[57.716181] RIP: 0010:[<ffffffffa02135aa>]  [<ffffffffa02135aa>] tipc_nametbl_unsubscribe+0x8a/   0x120 [tipc]
[...]
[57.812327] Call Trace:
[57.814806]  [<ffffffffa0211c77>] tipc_subscrp_delete+0x37/0x90 [tipc]
[57.821357]  [<ffffffffa0211e2f>] tipc_subscrp_timeout+0x3f/0x70 [tipc]
[57.827982]  [<ffffffff810618c1>] call_timer_fn+0x31/0x100
[57.833490]  [<ffffffff81062709>] run_timer_softirq+0x1f9/0x2b0
[57.839414]  [<ffffffff8105a795>] __do_softirq+0xe5/0x230
[57.844827]  [<ffffffff81520d1c>] call_softirq+0x1c/0x30
[57.850150]  [<ffffffff81004665>] do_softirq+0x55/0x90
[57.855285]  [<ffffffff8105aa35>] irq_exit+0x95/0xa0
[57.860290]  [<ffffffff815215b5>] smp_apic_timer_interrupt+0x45/0x60
[57.866644]  [<ffffffff8152005d>] apic_timer_interrupt+0x6d/0x80
[57.872686]  [<ffffffffa02121c5>] tipc_subscrb_rcv_cb+0x2a5/0x3f0 [tipc]
[57.879425]  [<ffffffffa021c65f>] tipc_receive_from_sock+0x9f/0x100 [tipc]
[57.886324]  [<ffffffffa021c826>] tipc_recv_work+0x26/0x60 [tipc]
[57.892463]  [<ffffffff8106fb22>] process_one_work+0x172/0x420
[57.898309]  [<ffffffff8107079a>] worker_thread+0x11a/0x3c0
[57.903871]  [<ffffffff81077114>] kthread+0xb4/0xc0
[57.908751]  [<ffffffff8151f318>] ret_from_fork+0x58/0x90

In this commit, we do the following at subscription creation:
1. set the subscription's subscriber pointer before performing
   tipc_nametbl_subscribe(), as this value is required further in
   the call chain ex: by tipc_subscrp_send_event().
2. move tipc_nametbl_subscribe() under the scope of subscriber lock

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan cb01c7c870 tipc: fix connection abort when receiving invalid cancel request
Until now, the subscribers endianness for a subscription
create/cancel request is determined as:
    swap = !(s->filter & (TIPC_SUB_PORTS | TIPC_SUB_SERVICE))
The checks are performed only for port/service subscriptions.

The swap calculation is incorrect if the filter in the subscription
cancellation request is set to TIPC_SUB_CANCEL (it's a malformed
cancel request, as the corresponding subscription create filter
is missing).
Thus, the check if the request is for cancellation fails and the
request is treated as a subscription create request. The
subscription creation fails as the request is illegal, which
terminates this connection.

In this commit we determine the endianness by including
TIPC_SUB_CANCEL, which will set swap correctly and the
request is processed as a cancellation request.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan c8beccc67c tipc: fix connection abort during subscription cancellation
In 'commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of subscription pointer (set in the function) instead of
the return code.

Unfortunately, the same function also handles subscription
cancellation request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus the connection is
terminated during cancellation request.

In this commit, we move the subcription cancel check outside
of tipc_subscrp_create(). Hence,
- tipc_subscrp_create() will create a subscripton
- tipc_subscrb_rcv_cb() will subscribe or cancel a subscription.

Fixes: 'commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing to events")'

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:58 -05:00
Parthasarathy Bhuvaragan 7c13c62241 tipc: introduce tipc_subscrb_subscribe() routine
In this commit, we split tipc_subscrp_create() into two:
1. tipc_subscrp_create() creates a subscription
2. A new function tipc_subscrp_subscribe() adds the
   subscription to the subscriber subscription list,
   activates the subscription timer and subscribes to
   the nametable updates.

In future commits, the purpose of tipc_subscrb_rcv_cb() will
be to either subscribe or cancel a subscription.

There is no functional change in this commit.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:41:57 -05:00
Parthasarathy Bhuvaragan a4273c73eb tipc: remove struct tipc_name_seq from struct tipc_subscription
Until now, struct tipc_subscriber has duplicate fields for
type, upper and lower (as member of struct tipc_name_seq) at:
1. as member seq in struct tipc_subscription
2. as member seq in struct tipc_subscr, which is contained
   in struct tipc_event
The former structure contains the type, upper and lower
values in network byte order and the later contains the
intact copy of the request.
The struct tipc_subscription contains a field swap to
determine if request needs network byte order conversion.
Thus by using swap, we can convert the request when
required instead of duplicating it.

In this commit,
1. we remove the references to these elements as members of
   struct tipc_subscription and replace them with elements
   from struct tipc_subscr.
2. provide new functions to convert the user request into
   network byte order.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:40:43 -05:00
Parthasarathy Bhuvaragan 3086523149 tipc: remove filter and timeout elements from struct tipc_subscription
Until now, struct tipc_subscription has duplicate timeout and filter
attributes present:
1. directly as members of struct tipc_subscription
2. in struct tipc_subscr, which is contained in struct tipc_event

In this commit, we remove the references to these elements as
members of struct tipc_subscription and replace them with elements
from struct tipc_subscr.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:40:43 -05:00
Parthasarathy Bhuvaragan 4f61d4ef70 tipc: remove incorrect check for subscription timeout value
Until now, during subscription creation we set sub->timeout by
converting the timeout request value in milliseconds to jiffies.
This is followed by setting the timeout value in the timer if
sub->timeout != TIPC_WAIT_FOREVER.

For a subscription create request with a timeout value of
TIPC_WAIT_FOREVER, msecs_to_jiffies(TIPC_WAIT_FOREVER)
returns MAX_JIFFY_OFFSET (0xfffffffe). This is not equal to
TIPC_WAIT_FOREVER (0xffffffff).

In this commit, we remove this check.

Acked-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 03:40:43 -05:00
Richard Alpe 817298102b tipc: fix link priority propagation
Currently link priority changes isn't handled for active links. In
this patch we resolve this by changing our priority if the peer passes
a valid priority in a state message.

Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 02:45:38 -05:00
Richard Alpe d01332f1ac tipc: fix link attribute propagation bug
Changing certain link attributes (link tolerance and link priority)
from the TIPC management tool is supposed to automatically take
effect at both endpoints of the affected link.

Currently the media address is not instantiated for the link and is
used uninstantiated when crafting protocol messages designated for the
peer endpoint. This means that changing a link property currently
results in the property being changed on the local machine but the
protocol message designated for the peer gets lost. Resulting in
property discrepancy between the endpoints.

In this patch we resolve this by using the media address from the
link entry and using the bearer transmit function to send it. Hence,
we can now eliminate the redundant function tipc_link_prot_xmit() and
the redundant field tipc_link::media_addr.

Fixes: 2af5ae372a (tipc: clean up unused code and structures)
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Reported-by: Jason Hu <huzhijiang@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 02:45:27 -05:00
Parthasarathy Bhuvaragan 4d5cfcba2f tipc: fix connection abort during subscription cancel
In 'commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of the subscription pointer (set in the function) instead
of the return code.

Unfortunately, the same function tipc_subscrp_create() handles
subscription cancel request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus if a subscriber has
several subscriptions and cancels any of them, the connection is
terminated.

In this commit, we terminate the connection based on the return value
of tipc_subscrp_create().
Fixes: commit 7fe8097cef ("tipc: fix nullpointer bug when subscribing to events")

Reviewed-by:  Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-29 15:14:21 -05:00
Pravin B Shelar 039f50629b ip_tunnel: Move stats update to iptunnel_xmit()
By moving stats update into iptunnel_xmit(), we can simplify
iptunnel_xmit() usage. With this change there is no need to
call another function (iptunnel_xmit_stats()) to update stats
in tunnel xmit code path.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-25 23:32:23 -05:00
David S. Miller f188b951f3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/renesas/ravb_main.c
	kernel/bpf/syscall.c
	net/ipv4/ipmr.c

All three conflicts were cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 21:09:12 -05:00
Jon Paul Maloy dc8d1eb305 tipc: fix node reference count bug
Commit 5405ff6e15 ("tipc: convert node lock to rwlock")
introduced a bug to the node reference counter handling. When a
message is successfully sent in the function tipc_node_xmit(),
we return directly after releasing the node lock, instead of
continuing and decrementing the node reference counter as we
should do.

This commit fixes this bug.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 15:19:40 -05:00
Herbert Xu 1ce0bf50ae net: Generalise wq_has_sleeper helper
The memory barrier in the helper wq_has_sleeper is needed by just
about every user of waitqueue_active.  This patch generalises it
by making it take a wait_queue_head_t directly.  The existing
helper is renamed to skwq_has_sleeper.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30 14:47:33 -05:00
Ying Xue 7098356bac tipc: fix error handling of expanding buffer headroom
Coverity says:

*** CID 1338065:  Error handling issues  (CHECKED_RETURN)
/net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
156     	struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
157     	struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
158     	struct sk_buff *clone;
159     	struct rtable *rt;
160
161     	if (skb_headroom(skb) < UDP_MIN_HEADROOM)
>>>     CID 1338065:  Error handling issues  (CHECKED_RETURN)
>>>     Calling "pskb_expand_head" without checking return value (as is done elsewhere 51 out of 56 times).
162     		pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
163
164     	clone = skb_clone(skb, GFP_ATOMIC);
165     	skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
166     	ub = rcu_dereference_rtnl(b->media_ptr);
167     	if (!ub) {

When expanding buffer headroom over udp tunnel with pskb_expand_head(),
it's unfortunate that we don't check its return value. As a result, if
the function returns an error code due to the lack of memory, it may
cause unpredictable consequence as we unconditionally consider that
it's always successful.

Fixes: e53567948f ("tipc: conditionally expand buffer headroom over udp tunnel")
Reported-by: <scan-admin@coverity.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24 11:26:19 -05:00
Ying Xue f4195d1eac tipc: avoid packets leaking on socket receive queue
Even if we drain receive queue thoroughly in tipc_release() after tipc
socket is removed from rhashtable, it is possible that some packets
are in flight because some CPU runs receiver and did rhashtable lookup
before we removed socket. They will achieve receive queue, but nobody
delete them at all. To avoid this leak, we register a private socket
destructor to purge receive queue, meaning releasing packets pending
on receive queue will be delayed until the last reference of tipc
socket will be released.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23 23:45:15 -05:00
Jon Paul Maloy 9a65083827 tipc: correct settings of broadcast link state
Since commit 5266698661 ("tipc: let broadcast packet
reception use new link receive function") the broadcast send
link state was meant to always be set to LINK_ESTABLISHED, since
we don't need this link to follow the regular link FSM rules. It
was also the intention that this state anyway shouldn't impact
the run-time working state of the link, since the latter in
reality is controlled by the number of registered peers.

We have now discovered that this assumption is not quite correct.
If the broadcast link is reset because of too many retransmissions,
its state will inadvertently go to LINK_RESETTING, and never go
back to LINK_ESTABLISHED, because the LINK_FAILURE event was not
anticipated. This will work well once, but if it happens a second
time, the reset on a link in LINK_RESETTING has has no effect, and
neither the broadcast link nor the unicast links will go down as
they should.

Furthermore, it is confusing that the management tool shows that
this link is in UP state when that obviously isn't the case.

We now ensure that this state strictly follows the true working
state of the link. The state is set to LINK_ESTABLISHED when
the number of peers is non-zero, and to LINK_RESET otherwise.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:08:51 -05:00
Jon Paul Maloy 1a90632da8 tipc: eliminate remnants of hungarian notation
The number of variables with Hungarian notation (l_ptr, n_ptr etc.)
has been significantly reduced over the last couple of years.

We now root out the last traces of this practice.
There are no functional changes in this commit.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy 38206d5939 tipc: narrow down interface towards struct tipc_link
We move the definition of struct tipc_link from link.h to link.c in
order to minimize its exposure to the rest of the code.

When needed, we define new functions to make it possible for external
entities to access and set data in the link.

Apart from the above, there are no functional changes.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy 5be9c08671 tipc: narrow down exposure of struct tipc_node
In our effort to have less code and include dependencies between
entities such as node, link and bearer, we try to narrow down
the exposed interface towards the node as much as possible.

In this commit, we move the definition of struct tipc_node, along
with many of its associated function declarations, from node.h to
node.c. We also move some function definitions from link.c and
name_distr.c to node.c, since they access fields in struct tipc_node
that should not be externally visible. The moved functions are renamed
according to new location, and made static whenever possible.

There are no functional changes in this commit.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy 5405ff6e15 tipc: convert node lock to rwlock
According to the node FSM a node in state SELF_UP_PEER_UP cannot
change state inside a lock context, except when a TUNNEL_PROTOCOL
(SYNCH or FAILOVER) packet arrives. However, the node's individual
links may still change state.

Since each link now is protected by its own spinlock, we finally have
the conditions in place to convert the node spinlock to an rwlock_t.
If the node state and arriving packet type are rigth, we can let the
link directly receive the packet under protection of its own spinlock
and the node lock in read mode. In all other cases we use the node
lock in write mode. This enables full concurrent execution between
parallel links during steady-state traffic situations, i.e., 99+ %
of the time.

This commit implements this change.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy 2312bf61ae tipc: introduce per-link spinlock
As a preparation to allow parallel links to work more independently
from each other we introduce a per-link spinlock, to be stored in the
struct nodes's link entry area. Since the node lock still is a regular
spinlock there is no increase in parallellism at this stage.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy 1d7e1c2595 tipc: reduce code dependency between binding table and node layer
The file name_distr.c currently contains three functions,
named_cluster_distribute(), tipc_publ_subcscribe() and
tipc_publ_unsubscribe() that all directly access fields in
struct tipc_node. We want to eliminate such dependencies, so
we move those functions to the file node.c and rename them to
tipc_node_broadcast(), tipc_node_subscribe() and tipc_node_unsubscribe()
respectively.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy 5c10e97940 tipc: small cleanup of function tipc_node_check_state()
The function tipc_node_check_state() contains the core logics
for handling link synchronization and failover. For this reason,
it is important to keep it as comprehensible as possible.

In this commit, we make three small cleanups.

1) If the node is in state SELF_DOWN_PEER_LEAVING and the received
   packet confirms that the peer has lost contact, there will be no
   further action in this function. To make this clearer, we return
   from the function directly after the state change.

2) Since commit 0f8b8e28fb ("tipc: eliminate risk of stalled
   link synchronization") only the logically first TUNNEL_PROTO/SYNCH
   packet can alter the link state and set the synch point,
   independently of arrival order. Hence, there is not any longer any
   need to adjust the synch value in case such packets arrive in
   disorder. We remove this adjustment.

3) It is the intention that any message arriving on any of the links
   may trig a check for and possible termination of a node SYNCH state.
   A redundant and unnoticed check for tipc_link_is_synching() obviously
   beats this purpose, with the effect that only packets arriving on the
   synching link may currently end the synch state. We remove this check.
   This change will further shorten the synchronization period between
   parallel links.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:10 -05:00
Jon Paul Maloy c7cad0d6f7 tipc: move linearization of buffers to generic code
In commit 5cbb28a4bf ("tipc: linearize arriving NAME_DISTR
and LINK_PROTO buffers") we added linearization of NAME_DISTRIBUTOR,
LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE to the function
tipc_udp_recv(). The location of the change was selected in order
to make the commit easily appliable to 'net' and 'stable'.

We now move this linearization to where it should be done, in the
functions tipc_named_rcv() and tipc_link_proto_rcv() respectively.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 14:06:09 -05:00
David S. Miller 73186df8d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor overlapping changes in net/ipv4/ipmr.c, in 'net' we were
fixing the "BH-ness" of the counter bumps whilst in 'net-next'
the functions were modified to take an explicit 'net' parameter.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 13:41:45 -05:00
Jon Paul Maloy 5cbb28a4bf tipc: linearize arriving NAME_DISTR and LINK_PROTO buffers
Testing of the new UDP bearer has revealed that reception of
NAME_DISTRIBUTOR, LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE
message buffers is not prepared for the case that those may be
non-linear.

We now linearize all such buffers before they are delivered up to the
generic reception layer.

In order for the commit to apply cleanly to 'net' and 'stable', we do
the change in the function tipc_udp_recv() for now. Later, we will post
a commit to 'net-next' moving the linearization to generic code, in
tipc_named_rcv() and tipc_link_proto_rcv().

Fixes: commit d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-01 12:04:29 -05:00
Wu Fengguang 742e038330 tipc: link_is_bc_sndlink() can be static
TO: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
CC: Jon Maloy <jon.maloy@ericsson.com>
CC: Ying Xue <ying.xue@windriver.com>
CC: tipc-discussion@lists.sourceforge.net
CC: linux-kernel@vger.kernel.org

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-25 06:31:52 -07:00
Jon Paul Maloy 2af5ae372a tipc: clean up unused code and structures
After the previous changes in this series, we can now remove some
unused code and structures, both in the broadcast, link aggregation
and link code.

There are no functional changes in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:47 -07:00
Jon Paul Maloy c49a0a8439 tipc: ensure binding table initial distribution is sent via first link
Correct synchronization of the broadcast link at first contact between
two nodes is dependent on the assumption that the binding table "bulk"
update passes via the same link as the initial broadcast syncronization
message, i.e., via the first link that is established.

This is not guaranteed in the current implementation. If two link
come up very close to each other in time, the "bulk" may quite well
pass via the second link, and hence void the guarantee of a correct
initial synchronization before the broadcast link is opened.

This commit makes two small changes to strengthen this guarantee.

1) We let the second established link occupy slot 1 of the
   "active_links" array, while the first link will retain slot 0.
   (This is in reality a cosmetic change, we could just as well keep
    the current, opposite order)

2) We let the name distributor always use link selector/slot 0 when
   it sends it binding table updates.

The extra traffic bias on the first link caused by this change should
be negligible, since binding table updates constitutes a very small
fraction of the total traffic.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:46 -07:00
Jon Paul Maloy c72fa872a2 tipc: eliminate link's reference to owner node
With the recent commit series, we have established a one-way dependency
between the link aggregation (struct tipc_node) instances and their
pertaining tipc_link instances. This has enabled quite significant code
and structure simplifications.

In this commit, we eliminate the field 'owner', which points to an
instance of struct tipc_node, from struct tipc_link, and replace it with
a pointer to struct net, which is the only external reference now needed
by a link instance.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:45 -07:00
Jon Paul Maloy 7214bcf875 tipc: eliminate redundant buffer cloning at transmission
Since all packet transmitters (link, bcast, discovery) are now sending
consumable buffer clones to the bearer layer, we can remove the
redundant buffer cloning that is perfomed in the lower level functions
tipc_l2_send_msg() and tipc_udp_send_msg().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:44 -07:00
Jon Paul Maloy 60852d6795 tipc: let neighbor discoverer tranmsit consumable buffers
The neighbor discovery function currently uses the function
tipc_bearer_send() for transmitting packets, assuming that the
sent buffers are not consumed by the called function.

We want to change this, in order to avoid unnecessary buffer cloning
elswhere in the code.

This commit introduces a new function tipc_bearer_skb() which consumes
the sent buffers, and let the discoverer functions use this new call
instead. The discoverer does now itself perform the cloning when
that is necessary.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:44 -07:00
Jon Paul Maloy 959e1781aa tipc: introduce jumbo frame support for broadcast
Until now, we have only been supporting a fix MTU size of 1500 bytes
for all broadcast media, irrespective of their actual capability.

We now make the broadcast MTU adaptable to the carrying media, i.e.,
we use the smallest MTU supported by any of the interfaces attached
to TIPC.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:40 -07:00
Jon Paul Maloy b06b281e79 tipc: simplify bearer level broadcast
Until now, we have been keeping track of the exact set of broadcast
destinations though the help structure tipc_node_map. This leads us to
have to maintain a whole infrastructure for supporting this, including
a pseudo-bearer and a number of functions to manipulate both the bearers
and the node map correctly. Apart from the complexity, this approach is
also limiting, as struct tipc_node_map only can support cluster local
broadcast if we want to avoid it becoming excessively large. We want to
eliminate this limitation, in order to enable introduction of scoped
multicast in the future.

A closer analysis reveals that it is unnecessary maintaining this "full
set" overview; it is sufficient to keep a counter per bearer, indicating
how many nodes can be reached via this bearer at the moment. The protocol
is now robust enough to handle transitional discrepancies between the
nominal number of reachable destinations, as expected by the broadcast
protocol itself, and the number which is actually reachable at the
moment. The initial broadcast synchronization, in conjunction with the
retransmission mechanism, ensures that all packets will eventually be
acknowledged by the correct set of destinations.

This commit introduces these changes.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:39 -07:00
Jon Paul Maloy 5266698661 tipc: let broadcast packet reception use new link receive function
The code path for receiving broadcast packets is currently distinct
from the unicast path. This leads to unnecessary code and data
duplication, something that can be avoided with some effort.

We now introduce separate per-peer tipc_link instances for handling
broadcast packet reception. Each receive link keeps a pointer to the
common, single, broadcast link instance, and can hence handle release
and retransmission of send buffers as if they belonged to the own
instance.

Furthermore, we let each unicast link instance keep a reference to both
the pertaining broadcast receive link, and to the common send link.
This makes it possible for the unicast links to easily access data for
broadcast link synchronization, as well as for carrying acknowledges for
received broadcast packets.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:37 -07:00
Jon Paul Maloy fd556f209a tipc: introduce capability bit for broadcast synchronization
Until now, we have tried to support both the newer, dedicated broadcast
synchronization mechanism along with the older, less safe, RESET_MSG/
ACTIVATE_MSG based one. The latter method has turned out to be a hazard
in a highly dynamic cluster, so we find it safer to disable it completely
when we find that the former mechanism is supported by the peer node.

For this purpose, we now introduce a new capabability bit,
TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
syncronization is supported by the present node. The new bit is conveyed
between peers in the 'capabilities' field of neighbor discovery messages.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:35 -07:00
Jon Paul Maloy 2f56612457 tipc: let broadcast transmission use new link transmit function
This commit simplifies the broadcast link transmission function, by
leveraging previous changes to the link transmission function and the
broadcast transmission link life cycle.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:32 -07:00
Jon Paul Maloy c1ab3f1dea tipc: make struct tipc_link generic to support broadcast
Realizing that unicast is just a special case of broadcast, we also see
that we can go in the other direction, i.e., that modest changes to the
current unicast link can make it generic enough to support broadcast.

The following changes are introduced here:

- A new counter ("ackers") in struct tipc_link, to indicate how many
  peers need to ack a packet before it can be released.
- A corresponding counter in the skb user area, to keep track of how
  many peers a are left to ack before a buffer can be released.
- A new counter ("acked"), to keep persistent track of how far a peer
  has acked at the moment, i.e., where in the transmission queue to
  start updating buffers when the next ack arrives. This is to avoid
  double acknowledgements from a peer, with inadvertent relase of
  packets as a result.
- A more generic tipc_link_retrans() function, where retransmit starts
  from a given sequence number, instead of the first packet in the
  transmision queue. This is to minimize the number of retransmitted
  packets on the broadcast media.

When the new functionality is taken into use in the next commits,
we expect it to have minimal effect on unicast mode performance.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:32 -07:00
Jon Paul Maloy 323019069e tipc: use explicit allocation of broadcast send link
The broadcast link instance (struct tipc_link) used for sending is
currently aggregated into struct tipc_bclink. This means that we cannot
use the regular tipc_link_create() function for initiating the link, but
do instead have to initiate numerous fields directly from the
bcast_init() function.

We want to reduce dependencies between the broadcast functionality
and the inner workings of tipc_link. In this commit, we introduce
a new function tipc_bclink_create() to link.c, and allocate the
instance of the link separately using this function.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:30 -07:00
Jon Paul Maloy 0e05498e9e tipc: make link implementation independent from struct tipc_bearer
In reality, the link implementation is already independent from
struct tipc_bearer, in that it doesn't store any reference to it.
However, we still pass on a pointer to a bearer instance in the
function tipc_link_create(), just to have it extract some
initialization information from it.

I later commits, we need to create instances of tipc_link without
having any associated struct tipc_bearer. To facilitate this, we
want to extract the initialization data already in the creator
function in node.c, before calling tipc_link_create(), and pass
this info on as individual parameters in the call.

This commit introduces this change.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:30 -07:00
Jon Paul Maloy 5fd9fd6351 tipc: create broadcast transmission link at namespace init
The broadcast transmission link is currently instantiated when the
network subsystem is started, i.e., on order from user space via netlink.

This forces the broadcast transmission code to do unnecessary tests for
the existence of the transmission link, as well in single mode node as
in network mode.

In this commit, we do instead create the link during initialization of
the name space, and remove it when it is stopped. The fact that the
transmission link now has a guaranteed longer life cycle than any of its
potential clients paves the way for further code simplifcations
and optimizations.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:27 -07:00
Jon Paul Maloy 0043550b0a tipc: move broadcast link lock to struct tipc_net
The broadcast lock will need to be acquired outside bcast.c in a later
commit. For this reason, we move the lock to struct tipc_net. Consistent
with the changes in the previous commit, we also introducee two new
functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
currently using tipc_bclink_lock()/unlock() will be phased out during
the coming commits in this series.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:25 -07:00
Jon Paul Maloy 6beb19a62a tipc: move bcast definitions to bcast.c
Currently, a number of structure and function definitions related
to the broadcast functionality are unnecessarily exposed in the file
bcast.h. This obscures the fact that the external interface towards
the broadcast link in fact is very narrow, and causes unnecessary
recompilations of other files when anything changes in those
definitions.

In this commit, we move as many of those definitions as is currently
possible to the file bcast.c.

We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
since the name does not correctly describe the contents of this
struct, and will do so even less in the future, and because we want
to use the term 'link' more appropriately in the functionality
introduced later in this series.

Finally, we rename a couple of functions, such as tipc_bclink_xmit()
and others that will be kept in the future, to include the term 'bcast'
instead.

There are no functional changes in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:56:24 -07:00
David S. Miller ba3e2084f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/xfrm6_output.c
	net/openvswitch/flow_netlink.c
	net/openvswitch/vport-gre.c
	net/openvswitch/vport-vxlan.c
	net/openvswitch/vport.c
	net/openvswitch/vport.h

The openvswitch conflicts were overlapping changes.  One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.

The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:54:12 -07:00
Jon Paul Maloy e53567948f tipc: conditionally expand buffer headroom over udp tunnel
In commit d999297c3d ("tipc: reduce locking scope during packet reception")
we altered the packet retransmission function. Since then, when
restransmitting packets, we create a clone of the original buffer
using __pskb_copy(skb, MIN_H_SIZE), where MIN_H_SIZE is the size of
the area we want to have copied, but also the smallest possible TIPC
packet size. The value of MIN_H_SIZE is 24.

Unfortunately, __pskb_copy() also has the effect that the headroom
of the cloned buffer takes the size MIN_H_SIZE. This is too small
for carrying the packet over the UDP tunnel bearer, which requires
a minimum headroom of 28 bytes. A change to just use pskb_copy()
lets the clone inherit the original headroom of 80 bytes, but also
assumes that the copied data area is of at least that size, something
that is not always the case. So that is not a viable solution.

We now fix this by adding a check for sufficient headroom in the
transmit function of udp_media.c, and expanding it when necessary.

Fixes: commit d999297c3d ("tipc: reduce locking scope during packet reception")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:13:48 -07:00
Jon Paul Maloy 45c8b7b175 tipc: allow non-linear first fragment buffer
The current code for message reassembly is erroneously assuming that
the the first arriving fragment buffer always is linear, and then goes
ahead resetting the fragment list of that buffer in anticipation of
more arriving fragments.

However, if the buffer already happens to be non-linear, we will
inadvertently drop the already attached fragment list, and later
on trig a BUG() in __pskb_pull_tail().

We see this happen when running fragmented TIPC multicast across UDP,
something made possible since
commit d0f91938be ("tipc: add ip/udp media type")

We fix this by not resetting the fragment list when the buffer is non-
linear, and by initiatlizing our private fragment list tail pointer to
the tail of the existing fragment list.

Fixes: commit d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:08:24 -07:00
Jon Paul Maloy 53387c4e22 tipc: extend broadcast link window size
The default fix broadcast window size is currently set to 20 packets.
This is a very low value, set at a time when we were still testing on
10 Mb/s hubs, and a change to it is long overdue.

Commit 7845989cb4 ("net: tipc: fix stall during bclink wakeup procedure")
revealed a problem with this low value. For messages of importance LOW,
the backlog queue limit will be  calculated to 30 packets, while a
single, maximum sized message of 66000 bytes, carried across a 1500 MTU
network consists of 46 packets.

This leads to the following scenario (among others leading to the same
situation):

1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
   packets to the backlog queue.
2: Msg 2 of 46 packets is attempted sent, but rejected because there is
   no more space in the backlog queue at this level. The sender is added
   to the wakeup queue with a "pending packets chain size" number of 46.
3: Some packets in the transmit queue are acked and released. We try to
   wake up the sender, but the pending size of 46 is bigger than the LOW
   wakeup limit of 30, so this doesn't happen.
5: Subsequent acks releases all the remaining buffers. Each time we test
   for the wakeup criteria and find that 46 still is larger than 30,
   even after both the transmit and the backlog queues are empty.
6: The sender is never woken up and given a chance to send its message.
   He is stuck.

We could now loosen the wakeup criteria (used by link_prepare_wakeup())
to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
by ignoring the "pending packets chain size" value altogether, or we can
just increase the queue limits so that the criteria can be satisfied
anyway. There are good reasons (potentially multiple waiting senders) to
not opt for the former solution, so we choose the latter one.

This commit fixes the problem by giving the broadcast link window a
default value of 50 packets. We also introduce a new minimum link
window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
described situation. Finally, in order to not break any existing users
which may set the window explicitly, we enforce that the window is set
to the new minimum value in case the user is trying to set it to
anything lower.

Fixes: 7845989cb4 ("net: tipc: fix stall during bclink wakeup procedure")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-21 19:02:17 -07:00
David S. Miller 26440c835f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	net/ipv4/inet_connection_sock.c
	net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-20 06:08:27 -07:00
Jon Paul Maloy c819930090 tipc: update node FSM when peer RESET message is received
The change made in the previous commit revealed a small flaw in the way
the node FSM is updated. When the function tipc_node_link_down() is
called for the last link to a node, we should check whether this was
caused by a local reset or by a received RESET message from the peer.
In the latter case, we can directly issue a PEER_LOST_CONTACT_EVT to
the node FSM, so that it is ready to re-establish contact. If this is
not done, the peer node will sometimes have to go through a second
establish cycle before the link becomes stable.

We fix this in this commit by conditionally issuing the mentioned
event in the function tipc_node_link_down(). We also move LINK_RESET
FSM even away from the link_reset() function and into the caller
function, partially because it is easier to follow the code when state
changes are gathered at a limited number of locations, partially
because there will be cases in future commits where we don't want the
link to go RESET mode when link_reset() is called.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:23 -07:00
Jon Paul Maloy 282b3a0562 tipc: send out RESET immediately when link goes down
When a link is taken down because of a node local event, such as
disabling of a bearer or an interface, we currently leave it to the
peer node to discover the broken communication. The default time for
such failure discovery is 1.5-2 seconds.

If we instead allow the terminating link endpoint to send out a RESET
message at the moment it is reset, we can achieve the impression that
both endpoints are going down instantly. Since this is a very common
scenario, we find it worthwhile to make this small modification.

Apart from letting the link produce the said message, we also have to
ensure that the interface is able to transmit it before TIPC is
detached. We do this by performing the disabling of a bearer in three
steps:

1) Disable reception of TIPC packets from the interface in question.
2) Take down the links, while allowing them so send out a RESET message.
3) Disable transmission of TIPC packets on the interface.

Apart from this, we now have to react on the NETDEV_GOING_DOWN event,
instead of as currently the NEDEV_DOWN event, to ensure that such
transmission is possible during the teardown phase.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:22 -07:00
Jon Paul Maloy 73f646cec3 tipc: delay ESTABLISH state event when link is established
Link establishing, just like link teardown, is a non-atomic action, in
the sense that discovering that conditions are right to establish a link,
and the actual adding of the link to one of the node's send slots is done
in two different lock contexts. The link FSM is designed to help bridging
the gap between the two contexts in a safe manner.

We have now discovered a weakness in the implementaton of this FSM.
Because we directly let the link go from state LINK_ESTABLISHING to
state LINK_ESTABLISHED already in the first lock context, we are unable
to distinguish between a fully established link, i.e., a link that has
been added to its slot, and a link that has not yet reached the second
lock context. It may hence happen that a manual intervention, e.g., when
disabling an interface, causes the function tipc_node_link_down() to try
removing the link from the node slots, decrementing its active link
counter etc, although the link was never added there in the first place.

We solve this by delaying the actual state change until we reach the
second lock context, inside the function tipc_node_link_up(). This
makes it possible for potentail callers of __tipc_node_link_down() to
know if they should proceed or not, and the problem is solved.

Unforunately, the situation described above also has a second problem.
Since there by necessity is a tipc_node_link_up() call pending once
the node lock has been released, we must defuse that call by setting
the link back from LINK_ESTABLISHING to LINK_RESET state. This forces
us to make a slight modification to the link FSM, which will now look
as follows.

 +------------------------------------+
 |RESET_EVT                           |
 |                                    |
 |                             +--------------+
 |           +-----------------|   SYNCHING   |-----------------+
 |           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
 |           |                  A            |                  |
 |           |                  |            |                  |
 |           |                  |            |                  |
 |           |                  |SYNCH_      |SYNCH_            |
 |           |                  |BEGIN_EVT   |END_EVT           |
 |           |                  |            |                  |
 |           V                  |            V                  V
 |    +-------------+          +--------------+          +------------+
 |    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
 |    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
 |           |        EVT        |    A         RESET_EVT       |
 |           |                   |    |                         |
 |           |  +----------------+    |                         |
 |  RESET_EVT|  |RESET_EVT            |                         |
 |           |  |                     |                         |
 |           |  |                     |ESTABLISH_EVT            |
 |           |  |  +-------------+    |                         |
 |           |  |  | RESET_EVT   |    |                         |
 |           |  |  |             |    |                         |
 |           V  V  V             |    |                         |
 |    +-------------+          +--------------+        RESET_EVT|
 +--->|    RESET    |--------->| ESTABLISHING |<----------------+
      +-------------+ PEER_    +--------------+
       |           A  RESET_EVT       |
       |           |                  |
       |           |                  |
       |FAILOVER_  |FAILOVER_         |FAILOVER_
       |BEGIN_EVT  |END_EVT           |BEGIN_EVT
       |           |                  |
       V           |                  |
      +-------------+                 |
      | FAILINGOVER |<----------------+
      +-------------+

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:21 -07:00
Jon Paul Maloy 8306f99a51 tipc: disallow packet duplicates in link deferred queue
After the previous commits, we are guaranteed that no packets
of type LINK_PROTOCOL or with illegal sequence numbers will be
attempted added to the link deferred queue. This makes it possible to
make some simplifications to the sorting algorithm in the function
tipc_skb_queue_sorted().

We also alter the function so that it will drop packets if one with
the same seqeunce number is already present in the queue. This is
necessary because we have identified weird packet sequences, involving
duplicate packets, where a legitimate in-sequence packet may advance to
the head of the queue without being detected and de-queued.

Finally, we make this function outline, since it will now be called only
in exceptional cases.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:21 -07:00
Jon Paul Maloy 81204c492b tipc: improve sequence number checking
The sequence number of an incoming packet is currently only checked
for less than, equality to, or bigger than the next expected number,
meaning that the receive window in practice becomes one half sequence
number cycle, or U16_MAX/2. This does not make sense, and may not even
be safe if there are extreme delays in the network. Any packet sent by
the peer during the ongoing cycle must belong inside his current send
window, or should otherwise be dropped if possible.

Since a link endpoint cannot know its peer's current send window, it
has to base this sanity check on a worst-case assumption, i.e., that
the peer is using a maximum sized window of 8191 packets. Using this
assumption, we now add a check that the sequence number is not bigger
than next_expected + TIPC_MAX_LINK_WIN. We also re-order the checks
done, so that the receive window test is performed before the gap test.
This way, we are guaranteed that no packet with illegal sequence numbers
are ever added to the deferred queue.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:20 -07:00
Jon Paul Maloy f9aa358a81 tipc: simplify tipc_link_rcv() reception loop
Currently, all packets received in tipc_link_rcv() are unconditionally
added to the packet deferred queue, whereafter that queue is walked and
all its buffers evaluated for delivery. This is both non-optimal and
and makes the queue sorting function unnecessary complex.

This commit changes the loop so that an arrived packet is evaluated
first, and added to the deferred queue only when a sequence number gap
is discovered. A non-empty deferred queue is walked until it is empty
or until its head's sequence number doesn't fit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:19 -07:00
Jon Paul Maloy 9945e8043e tipc: limit usage of temporary skb list during packet reception
During packet reception, the function tipc_link_rcv() adds its accepted
packets to a temporary buffer queue, before finally splicing this queue
into the lock protected input queue that will be delivered up to the
socket layer. The purpose is to reduce potential contention on the input
queue lock. However, since the vast majority of packets arrive in
sequence, they will anyway be added one by one to the input queue, and
the use of the temporary queue becomes a sub-optimization.

The only case where this queue makes sense is when unpacking buffers
from a bundle packet; here we want to avoid dozens of small buffers
to be added individually to the lock-protected input queue in a tight
loop.

In this commit, we remove the general usage of the temporary queue,
and keep it only for the packet unbundling case.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-15 23:55:18 -07:00
Jon Paul Maloy dde4b5ae65 tipc: move fragment importance field to new header position
In commit e3eea1eb47 ("tipc: clean up handling of message priorities")
we introduced a field in the packet header for keeping track of the
priority of fragments, since this value is not present in the specified
protocol header. Since the value so far only is used at the transmitting
end of the link, we have not yet officially defined it as part of the
protocol.

Unfortunately, the field we use for keeping this value, bits 13-15 in
in word 5, has turned out to be a poor choice; it is already used by the
broadcast protocol for carrying the 'network id' field of the sending
node. Since packet fragments also need to be transported across the
broadcast protocol, the risk of conflict is obvious, and we see this
happen when we use network identities larger than 2^13-1. This has
escaped our testing because we have so far only been using small network
id values.

We now move this field to bits 0-2 in word 9, a field that is guaranteed
to be unused by all involved protocols.

Fixes: e3eea1eb47 ("tipc: clean up handling of message priorities")
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14 19:10:08 -07:00
Jon Paul Maloy 0f8b8e28fb tipc: eliminate risk of stalled link synchronization
In commit 6e498158a8 ("tipc: move link synch and failover to link aggregation level")
we introduced a new mechanism for performing link failover and
synchronization. We have now detected a bug in this mechanism.

During link synchronization we use the arrival of any packet on
the tunnel link to trig a check for whether it has reached the
synchronization point or not. This has turned out to be too
permissive, since it may cause an arriving non-last SYNCH packet to
end the synch state, just to see the next SYNCH packet initiate a
new synch state with a new, higher synch point. This is not fatal,
but should be avoided, because it may significantly extend the
synchronization period, while at the same time we are not allowed
to send NACKs if packets are lost. In the worst case, a low-traffic
user may see its traffic stall until a LINK_PROTOCOL state message
trigs the link to leave synchronization state.

At the same time, LINK_PROTOCOL packets which happen to have a (non-
valid) sequence number lower than the tunnel link's rcv_nxt value will
be consistently dropped, and will never be able to resolve the situation
described above.

We fix this by exempting LINK_PROTOCOL packets from the sequence number
check, as they should be. We also reduce (but don't completely
eliminate) the risk of entering multiple synchronization states by only
allowing the (logically) first SYNCH packet to initiate a synchronization
state. This works independently of actual packet arrival order.

Fixes: commit 6e498158a8 ("tipc: move link synch and failover to link aggregation level")

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-14 06:06:40 -07:00
Erik Hugne 4e3ae00100 tipc: reinitialize pointer after skb linearize
The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Tamás Végh <tamas.vegh@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:31:20 -07:00
Kolmakov Dmitriy 7845989cb4 net: tipc: fix stall during bclink wakeup procedure
If an attempt to wake up users of broadcast link is made when there is
no enough place in send queue than it may hang up inside the
tipc_sk_rcv() function since the loop breaks only after the wake up
queue becomes empty. This can lead to complete CPU stall with the
following message generated by RCU:

INFO: rcu_sched self-detected stall on CPU { 0}  (t=2101 jiffies
					g=54225 c=54224 q=11465)
Task dump for CPU 0:
tpch            R  running task        0 39949  39948 0x0000000a
 ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000
 ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800
 0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680
Call Trace:
 <IRQ>  [<ffffffff8106a4be>] sched_show_task+0xae/0x120
 [<ffffffff8106d8a8>] dump_cpu_task+0x38/0x40
 [<ffffffff81094a50>] rcu_dump_cpu_stacks+0x90/0xd0
 [<ffffffff81097c3b>] rcu_check_callbacks+0x3eb/0x6e0
 [<ffffffff8106e53f>] ? account_system_time+0x7f/0x170
 [<ffffffff81099e64>] update_process_times+0x34/0x60
 [<ffffffff810a84d1>] tick_sched_handle.isra.18+0x31/0x40
 [<ffffffff810a851c>] tick_sched_timer+0x3c/0x70
 [<ffffffff8109a43d>] __run_hrtimer.isra.34+0x3d/0xc0
 [<ffffffff8109aa95>] hrtimer_interrupt+0xc5/0x1e0
 [<ffffffff81030d52>] ? native_smp_send_reschedule+0x42/0x60
 [<ffffffff81032f04>] local_apic_timer_interrupt+0x34/0x60
 [<ffffffff810335bc>] smp_apic_timer_interrupt+0x3c/0x60
 [<ffffffff8165a3fb>] apic_timer_interrupt+0x6b/0x70
 [<ffffffff81659129>] ? _raw_spin_unlock_irqrestore+0x9/0x10
 [<ffffffff8107eb9f>] __wake_up_sync_key+0x4f/0x60
 [<ffffffffa313ddd1>] tipc_write_space+0x31/0x40 [tipc]
 [<ffffffffa313dadf>] filter_rcv+0x31f/0x520 [tipc]
 [<ffffffffa313d699>] ? tipc_sk_lookup+0xc9/0x110 [tipc]
 [<ffffffff81659259>] ? _raw_spin_lock_bh+0x19/0x30
 [<ffffffffa314122c>] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
 [<ffffffffa312e7ff>] tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
 [<ffffffffa313ce26>] tipc_node_unlock+0x186/0x190 [tipc]
 [<ffffffff81597c1c>] ? kfree_skb+0x2c/0x40
 [<ffffffffa313475c>] tipc_rcv+0x2ac/0x8c0 [tipc]
 [<ffffffffa312ff58>] tipc_l2_rcv_msg+0x38/0x50 [tipc]
 [<ffffffff815a76d3>] __netif_receive_skb_core+0x5a3/0x950
 [<ffffffff815a98d3>] __netif_receive_skb+0x13/0x60
 [<ffffffff815a993e>] netif_receive_skb_internal+0x1e/0x90
 [<ffffffff815aa138>] napi_gro_receive+0x78/0xa0
 [<ffffffffa07f93f4>] tg3_poll_work+0xc54/0xf40 [tg3]
 [<ffffffff81597c8c>] ? consume_skb+0x2c/0x40
 [<ffffffffa07f9721>] tg3_poll_msix+0x41/0x160 [tg3]
 [<ffffffff815ab0f2>] net_rx_action+0xe2/0x290
 [<ffffffff8104b92a>] __do_softirq+0xda/0x1f0
 [<ffffffff8104bc26>] irq_exit+0x76/0xa0
 [<ffffffff81004355>] do_IRQ+0x55/0xf0
 [<ffffffff8165a12b>] common_interrupt+0x6b/0x6b
 <EOI>

The issue occurs only when tipc_sk_rcv() is used to wake up postponed
senders:

	tipc_bclink_wakeup_users()
		// wakeupq - is a queue which consists of special
		// 		 messages with SOCK_WAKEUP type.
		tipc_sk_rcv(wakeupq)
			...
			while (skb_queue_len(inputq)) {
				filter_rcv(skb)
					// Here the type of message is checked
					// and if it is SOCK_WAKEUP then
					// it tries to wake up a sender.
					tipc_write_space(sk)
						wake_up_interruptible_sync_poll()
			}

After the sender thread is woke up it can gather control and perform
an attempt to send a message. But if there is no enough place in send
queue it will call link_schedule_user() function which puts a message
of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep.
Thus the size of the queue actually is not changed and the while()
loop never exits.

The approach I proposed is to wake up only senders for which there is
enough place in send queue so the described issue can't occur.
Moreover the same approach is already used to wake up senders on
unicast links.

I have got into the issue on our product code but to reproduce the
issue I changed a benchmark test application (from
tipcutils/demos/benchmark) to perform the following scenario:
	1. Run 64 instances of test application (nodes). It can be done
	   on the one physical machine.
	2. Each application connects to all other using TIPC sockets in
	   RDM mode.
	3. When setup is done all nodes start simultaneously send
	   broadcast messages.
	4. Everything hangs up.

The issue is reproducible only when a congestion on broadcast link
occurs. For example, when there are only 8 nodes it works fine since
congestion doesn't occur. Send queue limit is 40 in my case (I use a
critical importance level) and when 64 nodes send a message at the
same moment a congestion occurs every time.

Signed-off-by: Dmitry S Kolmakov <kolmakov.dmitriy@huawei.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-08 22:50:26 -07:00
Jon Paul Maloy 2be80c2d87 tipc: fix stale link problem during synchronization
Recent changes to the link synchronization means that we can now just
drop packets arriving on the synchronizing link before the synch point
is reached. This has lead to significant simplifications to the
implementation, but also turns out to have a flip side that we need
to consider.

Under unlucky circumstances, the two endpoints may end up
repeatedly dropping each other's packets, while immediately
asking for retransmission of the same packets, just to drop
them once more. This pattern will eventually be broken when
the synch point is reached on the other link, but before that,
the endpoints may have arrived at the retransmission limit
(stale counter) that indicates that the link should be broken.
We see this happen at rare occasions.

The fix for this is to not ask for retransmissions when a link is in
state LINK_SYNCHING. The fact that the link has reached this state
means that it has already received the first SYNCH packet, and that it
knows the synch point. Hence, it doesn't need any more packets until the
other link has reached the synch point, whereafter it can go ahead and
ask for the missing packets.

However, because of the reduced traffic on the synching link that
follows this change, it may now take longer to discover that the
synch point has been reached. We compensate for this by letting all
packets, on any of the links, trig a check for synchronization
termination. This is possible because the packets themselves don't
contain any information that is needed for discovering this condition.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:14:45 -07:00
Jon Paul Maloy 5ae2f8e685 tipc: interrupt link synchronization when a link goes down
When we introduced the new link failover/synch mechanism
in commit 6e498158a8
("tipc: move link synch and failover to link aggregation level"),
we missed the case when the non-tunnel link goes down during the link
synchronization period. In this case the tunnel link will remain in
state LINK_SYNCHING, something leading to unpredictable behavior when
the failover procedure is initiated.

In this commit, we ensure that the node and remaining link goes
back to regular communication state (SELF_UP_PEER_UP/LINK_ESTABLISHED)
when one of the parallel links goes down. We also ensure that we don't
re-enter synch mode if subsequent SYNCH packets arrive on the remaining
link.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:14:45 -07:00
Jon Paul Maloy 17b2063077 tipc: eliminate risk of premature link setup during failover
When a link goes down, and there is still a working link towards its
destination node, a failover is initiated, and the failed link is not
allowed to re-establish until that procedure is finished. To ensure
this, the concerned link endpoints are set to state LINK_FAILINGOVER,
and the node endpoints to NODE_FAILINGOVER during the failover period.

However, if the link reset is due to a disabled bearer, the corres-
ponding link endpoint is deleted, and only the node endpoint knows
about the ongoing failover. Now, if the disabled bearer is re-enabled
during the failover period, the discovery mechanism may create a new
link endpoint that is ready to be established, despite that this is not
permitted. This situation may cause both the ongoing failover and any
subsequent link synchronization to fail.

In this commit, we ensure that a newly created link goes directly to
state LINK_FAILINGOVER if the corresponding node state is
NODE_FAILINGOVER. This eliminates the problem described above.

Furthermore, we tighten the criteria for which packets are allowed
to end a failover state in the function tipc_node_check_state().
By checking that the receiving link is up and running, instead of just
checking that it is not in failover mode, we eliminate the risk that
protocol packets from the re-created link may cause the failover to
be prematurely terminated.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-23 16:14:45 -07:00
Richard Alpe 8f8ff9135b tipc: don't sanity check non-existing TLV (NL compat)
A zero length payload means that no TLV (Type Length Value) data has
been passed. Prior to this patch a non-existing TLV could be sanity
checked with TLV_OK() resulting in random behavior where a user
sending an empty message occasionally got a incorrect "operation not
supported" message back.

Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-17 10:39:54 -07:00
Roopa Prabhu 343d60aada ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.

All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-31 15:21:30 -07:00
Jon Paul Maloy 440d8963cd tipc: clean up link creation
We simplify the link creation function tipc_link_create() and the way
the link struct it is connected to the node struct. In particular, we
remove the duplicate initialization of some fields which are anyway set
in tipc_link_reset().

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:15 -07:00
Jon Paul Maloy 9073fb8be3 tipc: use temporary, non-protected skb queue for bundle reception
Currently, when we extract small messages from a message bundle, or
when many messages have accumulated in the link arrival queue, those
messages are added one by one to the lock protected link input queue.
This may increase contention with the reader of that queue, in
the function tipc_sk_rcv().

This commit introduces a temporary, unprotected input queue in
tipc_link_rcv() for such cases. Only when the arrival queue has been
emptied, and the function is ready to return, does it splice the whole
temporary queue into the real input queue.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:15 -07:00
Jon Paul Maloy 23d8335d78 tipc: remove implicit message delivery in node_unlock()
After the most recent changes, all access calls to a link which
may entail addition of messages to the link's input queue are
postpended by an explicit call to tipc_sk_rcv(), using a reference
to the correct queue.

This means that the potentially hazardous implicit delivery, using
tipc_node_unlock() in combination with a binary flag and a cached
queue pointer, now has become redundant.

This commit removes this implicit delivery mechanism both for regular
data messages and for binding table update messages.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 598411d70f tipc: make resetting of links non-atomic
In order to facilitate future improvements to the locking structure, we
want to make resetting and establishing of links non-atomic. I.e., the
functions tipc_node_link_up() and tipc_node_link_down() should be called
from outside the node lock context, and grab/release the node lock
themselves. This requires that we can freeze the link state from the
moment it is set to RESETTING or PEER_RESET in one lock context until
it is set to RESET or ESTABLISHING in a later context. The recently
introduced link FSM makes this possible, so we are now ready to introduce
the above change.

This commit implements this.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy cf148816ac tipc: move received discovery data evaluation inside node.c
The node lock is currently grabbed and and released in the function
tipc_disc_rcv() in the file discover.c. As a preparation for the next
commits, we need to move this node lock handling, along with the code
area it is covering, to node.c.

This commit introduces this change.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 662921cd0a tipc: merge link->exec_mode and link->state into one FSM
Until now, we have been handling link failover and synchronization
by using an additional link state variable, "exec_mode". This variable
is not independent of the link FSM state, something causing a risk of
inconsistencies, apart from the fact that it clutters the code.

The conditions are now in place to define a new link FSM that covers
all existing use cases, including failover and synchronization, and
eliminate the "exec_mode" field altogether. The FSM must also support
non-atomic resetting of links, which will be introduced later.

The new link FSM is shown below, with 7 states and 8 events.
Only events leading to state change are shown as edges.

+------------------------------------+
|RESET_EVT                           |
|                                    |
|                             +--------------+
|           +-----------------|   SYNCHING   |-----------------+
|           |FAILURE_EVT      +--------------+   PEER_RESET_EVT|
|           |                  A            |                  |
|           |                  |            |                  |
|           |                  |            |                  |
|           |                  |SYNCH_      |SYNCH_            |
|           |                  |BEGIN_EVT   |END_EVT           |
|           |                  |            |                  |
|           V                  |            V                  V
|    +-------------+          +--------------+          +------------+
|    |  RESETTING  |<---------|  ESTABLISHED |--------->| PEER_RESET |
|    +-------------+ FAILURE_ +--------------+ PEER_    +------------+
|           |        EVT        |    A         RESET_EVT       |
|           |                   |    |                         |
|           |                   |    |                         |
|           |    +--------------+    |                         |
|  RESET_EVT|    |RESET_EVT          |ESTABLISH_EVT            |
|           |    |                   |                         |
|           |    |                   |                         |
|           V    V                   |                         |
|    +-------------+          +--------------+        RESET_EVT|
+--->|    RESET    |--------->| ESTABLISHING |<----------------+
     +-------------+ PEER_    +--------------+
      |           A  RESET_EVT       |
      |           |                  |
      |           |                  |
      |FAILOVER_  |FAILOVER_         |FAILOVER_
      |BEGIN_EVT  |END_EVT           |BEGIN_EVT
      |           |                  |
      V           |                  |
     +-------------+                 |
     | FAILINGOVER |<----------------+
     +-------------+

These changes are fully backwards compatible.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 5045f7b900 tipc: move protocol message sending away from link FSM
The implementation of the link FSM currently takes decisions about and
sends out link protocol messages. This is unnecessary, since such
actions are not the result of any link state change, and are even
decided based on non-FSM state information ("silent_intv_cnt").

We now move the sending of unicast link protocol messages to the
function tipc_link_timeout(), and the initial broadcast synchronization
message to tipc_node_link_up(). The latter is done because a link
instance should not need to know whether it is the first or second
link to a destination. Such information is now restricted to and
handled by the link aggregation layer in node.c

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 6e498158a8 tipc: move link synch and failover to link aggregation level
Link failover and synchronization have until now been handled by the
links themselves, forcing them to have knowledge about and to access
parallel links in order to make the two algorithms work correctly.

In this commit, we move the control part of this functionality to the
link aggregation level in node.c, which is the right location for this.
As a result, the two algorithms become easier to follow, and the link
implementation becomes simpler.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:14 -07:00
Jon Paul Maloy 66996b6c47 tipc: extend node FSM
In the next commit, we will move link synch/failover orchestration to
the link aggregation level. In order to do this, we first need to extend
the node FSM with two more states, NODE_SYNCHING and NODE_FAILINGOVER,
plus four new events to enter and leave those states.

This commit introduces this change, without yet making use of it.
The node FSM now looks as follows:

                           +-----------------------------------------+
                           |                            PEER_DOWN_EVT|
                           |                                         |
  +------------------------+----------------+                        |
  |SELF_DOWN_EVT           |                |                        |
  |                        |                |                        |
  |              +-----------+          +-----------+                |
  |              |NODE_      |          |NODE_      |                |
  |   +----------|FAILINGOVER|<---------|SYNCHING   |------------+   |
  |   |SELF_     +-----------+ FAILOVER_+-----------+    PEER_   |   |
  |   |DOWN_EVT   |         A  BEGIN_EVT A         |     DOWN_EVT|   |
  |   |           |         |            |         |             |   |
  |   |           |         |            |         |             |   |
  |   |           |FAILOVER_|FAILOVER_   |SYNCH_   |SYNCH_       |   |
  |   |           |END_EVT  |BEGIN_EVT   |BEGIN_EVT|END_EVT      |   |
  |   |           |         |            |         |             |   |
  |   |           |         |            |         |             |   |
  |   |           |        +--------------+        |             |   |
  |   |           +------->|   SELF_UP_   |<-------+             |   |
  |   |   +----------------|   PEER_UP    |------------------+   |   |
  |   |   |SELF_DOWN_EVT   +--------------+     PEER_DOWN_EVT|   |   |
  |   |   |                   A          A                   |   |   |
  |   |   |                   |          |                   |   |   |
  |   |   |        PEER_UP_EVT|          |SELF_UP_EVT        |   |   |
  |   |   |                   |          |                   |   |   |
  V   V   V                   |          |                   V   V   V
+------------+       +-----------+    +-----------+       +------------+
|SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
|PEER_LEAVING|<------|PEER_COMING|    |SELF_COMING|------>|SELF_LEAVING|
+------------+ SELF_ +-----------+    +-----------+ PEER_ +------------+
       |       DOWN_EVT       A          A          DOWN_EVT     |
       |                      |          |                       |
       |                      |          |                       |
       |           SELF_UP_EVT|          |PEER_UP_EVT            |
       |                      |          |                       |
       |                      |          |                       |
       |PEER_DOWN_EVT       +--------------+        SELF_DOWN_EVT|
       +------------------->|  SELF_DOWN_  |<--------------------+
                            |  PEER_DOWN   |
                            +--------------+

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy 655fb243b8 tipc: reverse call order for link_reset()->node_link_down()
In many cases the call order when a link is reset goes as follows:
tipc_node_xx()->tipc_link_reset()->tipc_node_link_down()

This is not the right order if we want the node to be in control,
so in this commit we change the order to:
tipc_node_xx()->tipc_node_link_down()->tipc_link_reset()

The fact that tipc_link_reset() now is called from only one
location with a well-defined state will also facilitate later
simplifications of tipc_link_reset() and the link FSM.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy 6144a996a6 tipc: move all link_reset() calls to link aggregation level
In line with our effort to let the node level have full control over
its links, we want to move all link reset calls from link.c to node.c.
Some of the calls can be moved by simply moving the calling function,
when this is the right thing to do. For the remaining calls we use
the now established technique of returning a TIPC_LINK_DOWN_EVT
flag from tipc_link_rcv(), whereafter we perform the reset call when
the call returns.

This change serves as a preparation for the coming commits.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Paul Maloy cbeb83ca68 tipc: eliminate function tipc_link_activate()
The function tipc_link_activate() is redundant, since it mostly performs
settings that have already been done in a preceding tipc_link_reset().

There are three exceptions to this:
- The actual state change to TIPC_LINK_WORKING. This should anyway be done
  in the FSM, and not in a separate function.
- Registration of the link with the bearer. This should be done by the
  node, since we don't want the link to have any knowledge about its
  specific bearer.
- Call to tipc_node_link_up() for user access registration. With the new
  role distribution between link aggregation and link level this becomes
  the wrong call order; tipc_node_link_up() should instead be called
  directly as a result of a TIPC_LINK_UP event, hence by the node itself.

This commit implements those changes.

Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-30 17:25:13 -07:00
Jon Maloy 5a4c355229 tipc: fix bug in broadcast synch message create function
In commit d999297c3d
("tipc: reduce locking scope during packet reception") we introduced
a new function tipc_build_bcast_sync_msg(), which carries initial
synchronization data between two nodes at first contact and at
re-contact. In this function, we missed to add synchronization data,
with the effect that the broadcast link endpoints will fail to
synchronize correctly at re-contact between a running and a restarted
node. All other cases work as intended.

With this commit, we fix this bug.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-29 16:48:16 -07:00
Jon Paul Maloy cda3696d3d tipc: clean up socket layer message reception
When a message is received in a socket, one of the call chains
tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv())
or
tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv())
are followed. At each of these levels we may encounter situations
where the message may need to be rejected, or a new message
produced for transfer back to the sender. Despite recent
improvements, the current code for doing this is perceived
as awkward and hard to follow.

Leveraging the two previous commits in this series, we now
introduce a more uniform handling of such situations. We
let each of the functions in the chain itself produce/reverse
the message to be returned to the sender, but also perform the
actual forwarding. This simplifies the necessary logics within
each function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:31:50 -07:00
Jon Paul Maloy bcd3ffd4f6 tipc: introduce new tipc_sk_respond() function
Currently, we use the code sequence

if (msg_reverse())
   tipc_link_xmit_skb()

at numerous locations in socket.c. The preparation of arguments
for these calls, as well as the sequence itself, makes the code
unecessarily complex.

In this commit, we introduce a new function, tipc_sk_respond(),
that performs this call combination. We also replace some, but not
yet all, of these explicit call sequences with calls to the new
function. Notably, we let the function tipc_sk_proto_rcv() use
the new function to directly send out PROBE_REPLY messages,
instead of deferring this to the calling tipc_sk_rcv() function,
as we do now.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:31:50 -07:00
Jon Paul Maloy 29042e19f2 tipc: let function tipc_msg_reverse() expand header when needed
The shortest TIPC message header, for cluster local CONNECTED messages,
is 24 bytes long. With this format, the fields "dest_node" and
"orig_node" are optimized away, since they in reality are redundant
in this particular case.

However, the absence of these fields leads to code inconsistencies
that are difficult to handle in some cases, especially when we need
to reverse or reject messages at the socket layer.

In this commit, we concentrate the handling of the absent fields
to one place, by letting the function tipc_msg_reverse() reallocate
the buffer and expand the header to 32 bytes when necessary. This
means that the socket code now can assume that the two previously
absent fields are present in the header when a message needs to be
rejected. This opens up for some further simplifications of the
socket code.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26 16:31:50 -07:00
Jon Paul Maloy 16040894b2 tipc: fix compatibility bug
In commit d999297c3d
("tipc: reduce locking scope during packet reception") we introduced
a new function tipc_link_proto_rcv(). This function contains a bug,
so that it sometimes by error sends out a non-zero link priority value
in created protocol messages.

The bug may lead to an extra link reset at initial link establising
with older nodes. This will never happen more than once, whereafter
the link will work as intended.

We fix this bug in this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 16:23:50 -07:00
Jon Paul Maloy d999297c3d tipc: reduce locking scope during packet reception
We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:

We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.

Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.

This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 1a20cc254e tipc: introduce node contact FSM
The logics for determining when a node is permitted to establish
and maintain contact with its peer node becomes non-trivial in the
presence of multiple parallel links that may come and go independently.

A known failure scenario is that one endpoint registers both its links
to the peer lost, cleans up it binding table, and prepares for a table
update once contact is re-establihed, while the other endpoint may
see its links reset and re-established one by one, hence seeing
no need to re-synchronize the binding table. To avoid this, a node
must not allow re-establishing contact until it has confirmation that
even the peer has lost both links.

Currently, the mechanism for handling this consists of setting and
resetting two state flags from different locations in the code. This
solution is hard to understand and maintain. A closer analysis even
reveals that it is not completely safe.

In this commit we do instead introduce an FSM that keeps track of
the conditions for when the node can establish and maintain links.
It has six states and four events, and is strictly based on explicit
knowledge about the own node's and the peer node's contact states.
Only events leading to state change are shown as edges in the figure
below.

                             +--------------+
                             | SELF_UP/     |
           +---------------->| PEER_COMING  |-----------------+
    SELF_  |                 +--------------+                 |PEER_
    ESTBL_ |                        |                         |ESTBL_
    CONTACT|      SELF_LOST_CONTACT |                         |CONTACT
           |                        v                         |
           |                 +--------------+                 |
           |      PEER_      | SELF_DOWN/   |     SELF_       |
           |      LOST_   +--| PEER_LEAVING |<--+ LOST_       v
+-------------+   CONTACT |  +--------------+   | CONTACT  +-----------+
| SELF_DOWN/  |<----------+                     +----------| SELF_UP/  |
| PEER_DOWN   |<----------+                     +----------| PEER_UP   |
+-------------+   SELF_   |  +--------------+   | PEER_    +-----------+
           |      LOST_   +--| SELF_LEAVING/|<--+ LOST_       A
           |      CONTACT    | PEER_DOWN    |     CONTACT     |
           |                 +--------------+                 |
           |                         A                        |
    PEER_  |       PEER_LOST_CONTACT |                        |SELF_
    ESTBL_ |                         |                        |ESTBL_
    CONTACT|                 +--------------+                 |CONTACT
           +---------------->| PEER_UP/     |-----------------+
                             | SELF_COMING  |
                             +--------------+

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 8a1577c96f tipc: move link supervision timer to node level
In our effort to move control of the links to the link aggregation
layer, we move the perodic link supervision timer to struct tipc_node.
The new timer is shared between all links belonging to the node, thus
saving resources, while still kicking the FSM on both its pertaining
links at each expiration.

The current link timer and corresponding functions are removed.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 333ef69ed2 tipc: simplify link timer implementation
We create a second, simpler, link timer function, tipc_link_timeout().
The new function  makes use of the new FSM function introduced in the
previous commit, and just like it, takes a buffer queue as parameter.
It returns an event bit field and potentially a link protocol packet
to the caller.

The existing timer function, link_timeout(), is still needed for a
while, so we redesign it to become a wrapper around the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:16 -07:00
Jon Paul Maloy 6ab30f9cbe tipc: improve link FSM implementation
The link FSM implementation is currently unnecessarily complex.
It sometimes checks for conditional state outside the FSM data
before deciding next state, and often performs actions directly
inside the FSM logics.

In this commit, we create a second, simpler FSM implementation,
that as far as possible acts only on states and events that it is
strictly defined for, and postpone any actions until it is finished
with its decisions. It also returns an event flag field and an a
buffer queue which may potentially contain a protocol message to
be sent by the caller.

Unfortunately, we cannot yet make the FSM "clean", in the sense
that its decisions are only based on FSM state and event, and that
state changes happen only here. That will have to wait until the
activate/reset logics has been cleaned up in a future commit.

We also rename the link states as follows:

WORKING_WORKING -> TIPC_LINK_WORKING
WORKING_UNKNOWN -> TIPC_LINK_PROBING
RESET_UNKNOWN   -> TIPC_LINK_RESETTING
RESET_RESET     -> TIPC_LINK_ESTABLISHING

The existing FSM function, link_state_event(), is still needed for
a while, so we redesign it to make use of the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy 426cc2b86d tipc: introduce new link protocol msg create function
As a preparation for later changes, we introduce a new function
tipc_link_build_proto_msg(). Instead of actually sending the created
protocol message, it only creates it and adds it to the head of a
skb queue provided by the caller.

Since we still need the existing function tipc_link_protocol_xmit()
for a while, we redesign it to make use of the new function.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy d3504c3449 tipc: clean up definitions and usage of link flags
The status flag LINK_STOPPED is not needed any more, since the
mechanism for delayed deletion of links has been removed.
Likewise, LINK_STARTED and LINK_START_EVT are unnecessary,
because we can just as well start the link timer directly from
inside tipc_link_create().

We eliminate these flags in this commit.

Instead of the above flags, we now introduce three new link modes,
TIPC_LINK_OPEN, TIPC_LINK_BLOCKED and TIPC_LINK_TUNNEL. The values
indicate whether, and in the case of TIPC_LINK_TUNNEL, which, messages
the link is allowed to receive in this state. TIPC_LINK_BLOCKED also
blocks timer-driven protocol messages to be sent out, and any change
to the link FSM. Since the modes are mutually exclusive, we convert
them to state values, and rename the 'flags' field in struct tipc_link
to 'exec_mode'.

Finally, we move the #defines for link FSM states and events from link.h
into enums inside the file link.c, which is the real usage scope of
these definitions.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy af9b028e27 tipc: make media xmit call outside node spinlock context
Currently, message sending is performed through a deep call chain,
where the node spinlock is grabbed and held during a significant
part of the transmission time. This is clearly detrimental to
overall throughput performance; it would be better if we could send
the message after the spinlock has been released.

In this commit, we do instead let the call revert on the stack after
the buffer chain has been added to the transmission queue, whereafter
clones of the buffers are transmitted to the device layer outside the
spinlock scope.

As a further step in our effort to separate the roles of the node
and link entities we also move the function tipc_link_xmit() to
node.c, and rename it to tipc_node_xmit().

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy 22d85c7942 tipc: change sk_buffer handling in tipc_link_xmit()
When the function tipc_link_xmit() is given a buffer list for
transmission, it currently consumes the list both when transmission
is successful and when it fails, except for the special case when
it encounters link congestion.

This behavior is inconsistent, and needs to be corrected if we want
to avoid problems in later commits in this series.

In this commit, we change this to let the function consume the list
only when transmission is successful, and leave the list with the
sender in all other cases. We also modifiy the socket code so that
it adapts to this change, i.e., purges the list when a non-congestion
error code is returned.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:15 -07:00
Jon Paul Maloy 36e78a463b tipc: use bearer index when looking up active links
struct tipc_node currently holds two arrays of link pointers; one,
indexed by bearer identity, which contains all links irrespective of
current state, and one two-slot array for the currently active link
or links. The latter array contains direct pointers into the elements
of the former. This has the effect that we cannot know the bearer id of
a link when accessing it via the "active_links[]" array without actually
dereferencing the pointer, something we want to avoid in some cases.

In this commit, we do instead store the bearer identity in the
"active_links" array, and use this as an index to find the right element
in the overall link entry array. This change should be seen as a
preparation for the later commits in this series.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy d39bbd445d tipc: move link input queue to tipc_node
At present, the link input queue and the name distributor receive
queues are fields aggregated in struct tipc_link. This is a hazard,
because a link might be deleted while a receiving socket still keeps
reference to one of the queues.

This commit fixes this bug. However, rather than adding yet another
reference counter to the critical data path, we move the two queues
to safe ground inside struct tipc_node, which is already protected, and
let the link code only handle references to the queues. This is also
in line with planned later changes in this area.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy d3a43b907a tipc: move link creation from neighbor discoverer to node
As a step towards turning links into node internal entities, we move the
creation of links from the neighbor discovery logics to the node's link
control logics.

We also create an additional entry for the link's media address in the
newly introduced struct tipc_link_entry, since this is where it is
needed in the upcoming commits. The current copy in struct tipc_link
is kept for now, but will be removed later.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Jon Paul Maloy 9d13ec65ed tipc: introduce link entry structure to struct tipc_node
struct 'tipc_node' currently contains two arrays for link attributes,
one for the link pointers, and one for the usable link MTUs.

We now group those into a new struct 'tipc_link_entry', and intoduce
one single array consisting of such enties. Apart from being a cosmetic
improvement, this is a starting point for the strict master-slave
relation between node and link that we will introduce in the following
commits.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 20:41:14 -07:00
Stephen Smalley fdd75ea8df net/tipc: initialize security state for new connection socket
Calling connect() with an AF_TIPC socket would trigger a series
of error messages from SELinux along the lines of:
SELinux: Invalid class 0
type=AVC msg=audit(1434126658.487:34500): avc:  denied  { <unprintable> }
  for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
  tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable>
  permissive=0

This was due to a failure to initialize the security state of the new
connection sock by the tipc code, leaving it with junk in the security
class field and an unlabeled secid.  Add a call to security_sk_clone()
to inherit the security state from the parent socket.

Reported-by: Tim Shearer <tim.shearer@overturenetworks.com>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 16:08:23 -07:00
Jon Paul Maloy 7d967b673c tipc: purge backlog queue counters when broadcast link is reset
In commit 1f66d161ab
("tipc: introduce starvation free send algorithm")
we introduced a counter per priority level for buffers
in the link backlog queue. We also introduced a new
function tipc_link_purge_backlog(), to reset these
counters to zero when the link is reset.

Unfortunately, we missed to call this function when
the broadcast link is reset, with the result that the
values of these counters might be permanently skewed
when new nodes are attached. This may in the worst case
lead to permananent, but spurious, broadcast link
congestion, where no broadcast packets can be sent at
all.

We fix this bug with this commit.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-28 16:43:02 -07:00
David S. Miller 25c43bf13b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-06-13 23:56:52 -07:00
Erik Hugne b3be5e3e72 tipc: disconnect socket directly after probe failure
If the TIPC connection timer expires in a probing state, a
self abort message is supposed to be generated and delivered
to the local socket. This is currently broken, and the abort
message is actually sent out to the peer node with invalid
addressing information. This will cause the link to enter
a constant retransmission state and eventually reset.
We fix this by removing the self-abort message creation and
tear down connection immediately instead.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-10 22:05:20 -07:00
Ying Xue 1ea23a2117 tipc: unconditionally put sock refcnt when sock timer to be deleted is pending
As sock refcnt is taken when sock timer is started in
sk_reset_timer(), the sock refcnt should be put when sock timer
to be deleted is in pending state no matter what "probing_state"
value of tipc sock is.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-30 18:08:37 -07:00
Jon Paul Maloy f3903bcc00 tipc: fix bug in link protocol message create function
In commit dd3f9e70f5
("tipc: add packet sequence number at instant of transmission") we
made a change with the consequence that packets in the link backlog
queue don't contain valid sequence numbers.

However, when we create a link protocol message, we still use the
sequence number of the first packet in the backlog, if there is any,
as "next_sent" indicator in the message. This may entail unnecessary
retransissions or stale packet transmission when there is very low
traffic on the link.

This commit fixes this issue by only using the current value of
tipc_link::snd_nxt as indicator.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-26 19:43:03 -04:00
Ying Xue fa787ae062 tipc: use sock_create_kern interface to create kernel socket
After commit eeb1bd5c40 ("net: Add a struct net parameter to
sock_create_kern"), we should use sock_create_kern() to create kernel
socket as the interface doesn't reference count struct net any more.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 13:39:33 -04:00
Jon Paul Maloy dd3f9e70f5 tipc: add packet sequence number at instant of transmission
Currently, the packet sequence number is updated and added to each
packet at the moment a packet is added to the link backlog queue.
This is wasteful, since it forces the code to traverse the send
packet list packet by packet when adding them to the backlog queue.
It would be better to just splice the whole packet list into the
backlog queue when that is the right action to do.

In this commit, we do this change. Also, since the sequence numbers
cannot now be assigned to the packets at the moment they are added
the backlog queue, we do instead calculate and add them at the moment
of transmission, when the backlog queue has to be traversed anyway.
We do this in the function tipc_link_push_packet().

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:46 -04:00
Jon Paul Maloy f21e897ecc tipc: improve link congestion algorithm
The link congestion algorithm used until now implies two problems.

- It is too generous towards lower-level messages in situations of high
  load by giving "absolute" bandwidth guarantees to the different
  priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
  20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
  available bandwidth. But, in the absence of higher level traffic, the
  ratio between two distinct levels becomes unreasonable. E.g. if there
  is only LOW and MEDIUM traffic on a system, the former is guaranteed
  1/3 of the bandwidth, and the latter 2/3. This again means that if
  there is e.g. one LOW user and 10 MEDIUM users, the  former will have
  33.3% of the bandwidth, and the others will have to compete for the
  remainder, i.e. each will end up with 6.7% of the capacity.

- Packets of type MSG_BUNDLER are created at SYSTEM importance level,
  but only after the packets bundled into it have passed the congestion
  test for their own respective levels. Since bundled packets don't
  result in incrementing the level counter for their own importance,
  only occasionally for the SYSTEM level counter, they do in practice
  obtain SYSTEM level importance. Hence, the current implementation
  provides a gap in the congestion algorithm that in the worst case
  may lead to a link reset.

We now refine the congestion algorithm as follows:

- A message is accepted to the link backlog only if its own level
  counter, and all superior level counters, permit it.

- The importance of a created bundle packet is set according to its
  contents. A bundle packet created from messges at levels LOW to
  CRITICAL is given importance level CRITICAL, while a bundle created
  from a SYSTEM level message is given importance SYSTEM. In the latter
  case only subsequent SYSTEM level messages are allowed to be bundled
  into it.

This solves the first problem described above, by making the bandwidth
guarantee relative to the total number of users at all levels; only
the upper limit for each level remains absolute. In the example
described above, the single LOW user would use 1/11th of the bandwidth,
the same as each of the ten MEDIUM users, but he still has the same
guarantee against starvation as the latter ones.

The fix also solves the second problem. If the CRITICAL level is filled
up by bundle packets of that level, no lower level packets will be
accepted any more.

Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:46 -04:00
Jon Paul Maloy cd4eee3c2e tipc: simplify link supervision checkpointing
We change the sequence number checkpointing that is performed
by the timer in order to discover if the peer is active. Currently,
we store a checkpoint of the next expected sequence number "rcv_nxt"
at each timer expiration, and compare it to the current expected
number at next timeout expiration. Instead, we now use the already
existing field "silent_intv_cnt" for this task. We step the counter
at each timeout expiration, and zero it at each valid received packet.
If no valid packet has been received from the peer after "abort_limit"
number of silent timer intervals, the link is declared faulty and reset.

We also remove the multiple instances of timer activation from inside
the FSM function "link_state_event()", and now do it at only one place;
at the end of the timer function itself.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:46 -04:00
Jon Paul Maloy a97b9d3fa9 tipc: rename fields in struct tipc_link
We rename some fields in struct tipc_link, in order to give them more
descriptive names:

next_in_no -> rcv_nxt
next_out_no-> snd_nxt
fsm_msg_cnt-> silent_intv_cnt
cont_intv  -> keepalive_intv
last_retransmitted -> last_retransm

There are no functional changes in this commit.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:46 -04:00
Jon Paul Maloy e4bf4f7696 tipc: simplify packet sequence number handling
Although the sequence number in the TIPC protocol is 16 bits, we have
until now stored it internally as an unsigned 32 bits integer.
We got around this by always doing explicit modulo-65535 operations
whenever we need to access a sequence number.

We now make the incoming and outgoing sequence numbers to unsigned
16-bit integers, and remove the modulo operations where applicable.

We also move the arithmetic inline functions for 16 bit integers
to core.h, and the function buf_seqno() to msg.h, so they can easily
be accessed from anywhere in the code.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:46 -04:00
Jon Paul Maloy a6bf70f792 tipc: simplify include dependencies
When we try to add new inline functions in the code, we sometimes
run into circular include dependencies.

The main problem is that the file core.h, which really should be at
the root of the dependency chain, instead is a leaf. I.e., core.h
includes a number of header files that themselves should be allowed
to include core.h. In reality this is unnecessary, because core.h does
not need to know the full signature of any of the structs it refers to,
only their type declaration.

In this commit, we remove all dependencies from core.h towards any
other tipc header file.

As a consequence of this change, we can now move the function
tipc_own_addr(net) from addr.c to addr.h, and make it inline.

There are no functional changes in this commit.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:45 -04:00
Jon Paul Maloy 75b44b018e tipc: simplify link timer handling
Prior to this commit, the link timer has been running at a "continuity
interval" of configured link tolerance/4. When a timer wakes up and
discovers that there has been no sign of life from the peer during the
previous interval, it divides its own timer interval by another factor
four, and starts sending one probe per new interval. When the configured
link tolerance time has passed without answer, i.e. after 16 unacked
probes, the link is declared faulty and reset.

This is unnecessary complex. It is sufficient to continue with the
original continuity interval, and instead reset the link after four
missed probe responses. This makes the timer handling in the link
simpler, and opens up for some planned later changes in this area.
This commit implements this change.

Reviewed-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:45 -04:00
Jon Paul Maloy b1c29f6b10 tipc: simplify resetting and disabling of bearers
Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494
("tipc: eliminate delayed link deletion at link failover") the extra
boolean parameter "shutting_down" is not any longer needed for the
functions bearer_disable() and tipc_link_delete_list().

Furhermore, the function tipc_link_reset_links(), called from
bearer_reset()  is now unnecessary. We can just as well delete
all the links, as we do in bearer_disable(), and start over with
creating new links.

This commit introduces those changes.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 12:24:45 -04:00
Eric W. Biederman 11aa9c28b4 net: Pass kern from net_proto_family.create to sk_alloc
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 10:50:17 -04:00
Richard Alpe b063bc5ea7 tipc: send explicit not supported error in nl compat
The legacy netlink API treated EPERM (permission denied) as
"operation not supported".

Reported-by: Tomi Ollila <tomi.ollila@iki.fi>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:40:03 -04:00
Richard Alpe 670f4f8818 tipc: add broadcast link window set/get to nl api
Add the ability to get or set the broadcast link window through the
new netlink API. The functionality was unintentionally missing from
the new netlink API. Adding this means that we also fix the breakage
in the old API when coming through the compat layer.

Fixes: 37e2d4843f (tipc: convert legacy nl link prop set to nl compat)
Reported-by: Tomi Ollila <tomi.ollila@iki.fi>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:40:02 -04:00
Richard Alpe c3d6fb85b2 tipc: fix default link prop regression in nl compat
Default link properties can be set for media or bearer. This
functionality was missed when introducing the NL compatibility layer.

This patch implements this functionality in the compat netlink
layer. It works the same way as it did in the old API. We search for
media and bearers matching the "link name". If we find a matching
media or bearer the link tolerance, priority or window is used as
default for new links on that media or bearer.

Fixes: 37e2d4843f (tipc: convert legacy nl link prop set to nl compat)
Reported-by: Tomi Ollila <tomi.ollila@iki.fi>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:40:02 -04:00
Ying Xue 90bdfcb76f tipc: deal with return value of tipc_conn_new callback
Once tipc_conn_new() returns NULL, the connection should be shut
down immediately, otherwise, oops may happen due to the NULL pointer.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:04:01 -04:00
Ying Xue a13683f292 tipc: adjust locking policy of subscription
Currently subscriber's lock protects not only subscriber's subscription
list but also all subscriptions linked into the list. However, as all
members of subscription are never changed after they are initialized,
it's unnecessary for subscription to be protected under subscriber's
lock. If the lock is used to only protect subscriber's subscription
list, the adjustment not only makes the locking policy simpler, but
also helps to avoid a deadlock which may happen once creating a
subscription is failed.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:04:01 -04:00
Ying Xue 00bc00a938 tipc: involve reference counter for subscriber
At present subscriber's lock is used to protect the subscription list
of subscriber as well as subscriptions linked into the list. While one
or all subscriptions are deleted through iterating the list, the
subscriber's lock must be held. Meanwhile, as deletion of subscription
may happen in subscription timer's handler, the lock must be grabbed
in the function as well. When subscription's timer is terminated with
del_timer_sync() during above iteration, subscriber's lock has to be
temporarily released, otherwise, deadlock may occur. However, the
temporary release may cause the double free of a subscription as the
subscription is not disconnected from the subscription list.

Now if a reference counter is introduced to subscriber, subscription's
timer can be asynchronously stopped with del_timer(). As a result, the
issue is not only able to be fixed, but also relevant code is pretty
readable and understandable.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:04:01 -04:00
Ying Xue 1b764828ad tipc: introduce tipc_subscrb_create routine
Introducing a new function makes the purpose of tipc_subscrb_connect_cb
callback routine more clear.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:04:01 -04:00
Ying Xue 57f1d1868f tipc: rename functions defined in subscr.c
When a topology server accepts a connection request from its client,
it allocates a connection instance and a tipc_subscriber structure
object. The former is used to communicate with client, and the latter
is often treated as a subscriber which manages all subscription events
requested from a same client. When a topology server receives a request
of subscribing name services from a client through the connection, it
creates a tipc_subscription structure instance which is seen as a
subscription recording what name services are subscribed. In order to
manage all subscriptions from a same client, topology server links
them into the subscrp_list of the subscriber. So subscriber and
subscription completely represents different meanings respectively,
but function names associated with them make us so confused that we
are unable to easily tell which function is against subscriber and
which is to subscription. So we want to eliminate the confusion by
renaming them.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 15:04:00 -04:00
Jon Paul Maloy 0d699f28ee tipc: fix problem with parallel link synchronization mechanism
Currently, we try to accumulate arrived packets in the links's
'deferred' queue during the parallel link syncronization phase.

This entails two problems:

- With an unlucky combination of arriving packets the algorithm
  may go into a lockstep with the out-of-sequence handling function,
  where the synch mechanism is adding a packet to the deferred queue,
  while the out-of-sequence handling is retrieving it again, thus
  ending up in a loop inside the node_lock scope.

- Even if this is avoided, the link will very often send out
  unnecessary protocol messages, in the worst case leading to
  redundant retransmissions.

We fix this by just dropping arriving packets on the upcoming link
during the synchronization phase, thus relying on the retransmission
protocol to resolve the situation once the two links have arrived to
a synchronized state.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 15:08:59 -04:00
Nicolas Dichtel f2f67390a4 tipc: remove wrong use of NLM_F_MULTI
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
it is sent only at the end of a dump.

Libraries like libnl will wait forever for NLMSG_DONE.

Fixes: 35b9dd7607 ("tipc: add bearer get/dump to new netlink api")
Fixes: 7be57fc691 ("tipc: add link get/dump to new netlink api")
Fixes: 46f15c6794 ("tipc: add media get/dump to new netlink api")
CC: Richard Alpe <richard.alpe@ericsson.com>
CC: Jon Maloy <jon.maloy@ericsson.com>
CC: Ying Xue <ying.xue@windriver.com>
CC: tipc-discussion@lists.sourceforge.net
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-29 14:59:17 -04:00
Erik Hugne 73a3173773 tipc: fix node refcount issue
When link statistics is dumped over netlink, we iterate over
the list of peer nodes and append each links statistics to
the netlink msg. In the case where the dump is resumed after
filling up a nlmsg, the node refcnt is decremented without
having been incremented previously which may cause the node
reference to be freed. When this happens, the following
info/stacktrace will be generated, followed by a crash or
undefined behavior.
We fix this by removing the erroneous call to tipc_node_put
inside the loop that iterates over nodes.

[  384.312303] INFO: trying to register non-static key.
[  384.313110] the code is fine but needs lockdep annotation.
[  384.313290] turning off the locking correctness validator.
[  384.313290] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.0.0+ #13
[  384.313290] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  384.313290]  ffff88003c6d0290 ffff88003cc03ca8 ffffffff8170adf1 0000000000000007
[  384.313290]  ffffffff82728730 ffff88003cc03d38 ffffffff810a6a6d 00000000001d7200
[  384.313290]  ffff88003c6d0ab0 ffff88003cc03ce8 0000000000000285 0000000000000001
[  384.313290] Call Trace:
[  384.313290]  <IRQ>  [<ffffffff8170adf1>] dump_stack+0x4c/0x65
[  384.313290]  [<ffffffff810a6a6d>] __lock_acquire+0xf3d/0xf50
[  384.313290]  [<ffffffff810a7375>] lock_acquire+0xd5/0x290
[  384.313290]  [<ffffffffa0043e8c>] ? link_timeout+0x1c/0x170 [tipc]
[  384.313290]  [<ffffffffa0043e70>] ? link_state_event+0x4e0/0x4e0 [tipc]
[  384.313290]  [<ffffffff81712890>] _raw_spin_lock_bh+0x40/0x80
[  384.313290]  [<ffffffffa0043e8c>] ? link_timeout+0x1c/0x170 [tipc]
[  384.313290]  [<ffffffffa0043e8c>] link_timeout+0x1c/0x170 [tipc]
[  384.313290]  [<ffffffff810c4698>] call_timer_fn+0xb8/0x490
[  384.313290]  [<ffffffff810c45e0>] ? process_timeout+0x10/0x10
[  384.313290]  [<ffffffff810c5a2c>] run_timer_softirq+0x21c/0x420
[  384.313290]  [<ffffffffa0043e70>] ? link_state_event+0x4e0/0x4e0 [tipc]
[  384.313290]  [<ffffffff8105a954>] __do_softirq+0xf4/0x630
[  384.313290]  [<ffffffff8105afdd>] irq_exit+0x5d/0x60
[  384.313290]  [<ffffffff8103ade1>] smp_apic_timer_interrupt+0x41/0x50
[  384.313290]  [<ffffffff817144a0>] apic_timer_interrupt+0x70/0x80
[  384.313290]  <EOI>  [<ffffffff8100db10>] ? default_idle+0x20/0x210
[  384.313290]  [<ffffffff8100db0e>] ? default_idle+0x1e/0x210
[  384.313290]  [<ffffffff8100e61a>] arch_cpu_idle+0xa/0x10
[  384.313290]  [<ffffffff81099803>] cpu_startup_entry+0x2c3/0x530
[  384.313290]  [<ffffffff810d2893>] ? clockevents_register_device+0x113/0x200
[  384.313290]  [<ffffffff81038b0f>] start_secondary+0x13f/0x170

Fixes: 8a0f6ebe84 ("tipc: involve reference counter for node structure")
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 11:50:34 -04:00
Erik Hugne 9871b27f67 tipc: fix random link reset problem
In the function tipc_sk_rcv(), the stack variable 'err'
is only initialized to TIPC_ERR_NO_PORT for the first
iteration over the link input queue. If a chain of messages
are received from a link, failure to lookup the socket for
any but the first message will cause the message to bounce back
out on a random link.
We fix this by properly initializing err.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 11:50:34 -04:00
Ying Xue def81f69bf tipc: fix topology server broken issue
When a new topology server is launched in a new namespace, its
listening socket is inserted into the "init ns" namespace's socket
hash table rather than the one owned by the new namespace. Although
the socket's namespace is forcedly changed to the new namespace later,
the socket is still stored in the socket hash table of "init ns"
namespace. When a client created in the new namespace connects
its own topology server, the connection is failed as its server's
socket could not be found from its own namespace's socket table.

If __sock_create() instead of original sock_create_kern() is used
to create the server's socket through specifying an expected namesapce,
the socket will be inserted into the specified namespace's socket
table, thereby avoiding to the topology server broken issue.

Fixes: 76100a8a64 ("tipc: fix netns refcnt leak")

Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-23 11:50:34 -04:00
David Miller 79b16aadea udp_tunnel: Pass UDP socket down through udp_tunnel{, 6}_xmit_skb().
That was we can make sure the output path of ipv4/ipv6 operate on
the UDP socket rather than whatever random thing happens to be in
skb->sk.

Based upon a patch by Jiri Pirko.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
2015-04-07 15:29:08 -04:00
Jon Paul Maloy ed193ece26 tipc: simplify link mtu negotiation
When a link is being established, the two endpoints advertise their
respective interface MTU in the transmitted RESET and ACTIVATE messages.
If there is any difference, the lower of the two MTUs will be selected
for use by both endpoints.

However, as a remnant of earlier attempts to introduce TIPC level
routing. there also exists an MTU discovery mechanism. If an intermediate
node has a lower MTU than the two endpoints, they will discover this
through a bisectional approach, and finally adopt this MTU for common use.

Since there is no TIPC level routing, and probably never will be,
this mechanism doesn't make any sense, and only serves to make the
link level protocol unecessarily complex.

In this commit, we eliminate the MTU discovery algorithm,and fall back
to the simple MTU advertising approach. This change is fully backwards
compatible.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:27:12 -04:00
Jon Paul Maloy dff29b1a88 tipc: eliminate delayed link deletion at link failover
When a bearer is disabled manually, all its links have to be reset
and deleted. However, if there is a remaining, parallel link ready
to take over a deleted link's traffic, we currently delay the delete
of the removed link until the failover procedure is finished. This
is because the remaining link needs to access state from the reset
link, such as the last received packet number, and any partially
reassembled buffer, in order to perform a successful failover.

In this commit, we do instead move the state data over to the new
link, so that it can fulfill the procedure autonomously, without
accessing any data on the old link. This means that we can now
proceed and delete all pertaining links immediately when a bearer
is disabled. This saves us from some unnecessary complexity in such
situations.

We also choose to change the confusing definitions CHANGEOVER_PROTOCOL,
ORIGINAL_MSG and DUPLICATE_MSG to the more descriptive TUNNEL_PROTOCOL,
FAILOVER_MSG and SYNCH_MSG respectively.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:27:12 -04:00
Jon Paul Maloy 2da7142516 tipc: drop tunneled packet duplicates at reception
In commit 8b4ed8634f
("tipc: eliminate race condition at dual link establishment")
we introduced a parallel link synchronization mechanism that
guarentees sequential delivery even for users switching from
an old to a newly established link. The new mechanism makes it
unnecessary to deliver the tunneled duplicate packets back to
the old link, as we are currently doing. It is now sufficient
to use the last tunneled packet's inner sequence number as
synchronization point between the two parallel links, whereafter
it can be dropped.

In this commit, we drop the duplicate packets arriving on the new
link, after updating the synchronization point at each new arrival.

Although it would now have been sufficient for the other endpoint
to only tunnel the last packet in its send queue, and not the
entire queue, we must still do this to maintain compatibility
with older nodes.

This commit makes it possible to get rid if some complex
interaction between the two parallel links.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:27:12 -04:00
David S. Miller 9f0d34bc34 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/asix_common.c
	drivers/net/usb/sr9800.c
	drivers/net/usb/usbnet.c
	include/linux/usb/usbnet.h
	net/ipv4/tcp_ipv4.c
	net/ipv6/tcp_ipv6.c

The TCP conflicts were overlapping changes.  In 'net' we added a
READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
Eric Dumazet touched the surrounding code dealing with how mini
sockets are handled.

With USB, it's a case of the same bug fix first going into net-next
and then I cherry picked it back into net.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 16:16:53 -04:00
Ying Xue 7e43690578 tipc: fix a slab object leak
When remove TIPC module, there is a warning to remind us that a slab
object is leaked like:

root@localhost:~# rmmod tipc
[   19.056226] =============================================================================
[   19.057549] BUG TIPC (Not tainted): Objects remaining in TIPC on kmem_cache_close()
[   19.058736] -----------------------------------------------------------------------------
[   19.058736]
[   19.060287] INFO: Slab 0xffffea0000519a00 objects=23 used=1 fp=0xffff880014668b00 flags=0x100000000004080
[   19.061915] INFO: Object 0xffff880014668000 @offset=0
[   19.062717] kmem_cache_destroy TIPC: Slab cache still has objects

This is because the listening socket of TIPC topology server is not
closed before TIPC proto handler is unregistered with proto_unregister().
However, as the socket is closed in tipc_exit_net() which is called by
unregister_pernet_subsys() during unregistering TIPC namespace operation,
the warning can be eliminated if calling unregister_pernet_subsys() is
moved before calling proto_unregister().

Fixes: e05b31f4bf ("tipc: make tipc socket support net namespace")
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 23:10:08 -04:00
Jon Paul Maloy d482994fca tipc: fix two bugs in secondary destination lookup
A message sent to a node after a successful name table lookup may still
find that the destination socket has disappeared, because distribution
of name table updates is non-atomic. If so, the message will be rejected
back to the sender with error code TIPC_ERR_NO_PORT. If the source
socket of the message has disappeared in the meantime, the message
should be dropped.

However, in the currrent code, the message will instead be subject to an
unwanted tertiary lookup, because the function tipc_msg_lookup_dest()
doesn't check if there is an error code present in the message before
performing the lookup. In the worst case, the message may now find the
old destination again, and be redirected once more, instead of being
dropped directly as it should be.

A second bug in this function is that the "prev_node" field in the message
is not updated after successful lookup, something that may have
unpredictable consequences.

The problems arising from those bugs occur very infrequently.

The third change in this function; the test on msg_reroute_msg_cnt() is
purely cosmetic, reflecting that the returned value never can be negative.

This commit corrects the two bugs described above.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 13:47:36 -07:00
Ying Xue 8a0f6ebe84 tipc: involve reference counter for node structure
TIPC node hash node table is protected with rcu lock on read side.
tipc_node_find() is used to look for a node object with node address
through iterating the hash node table. As the entire process of what
tipc_node_find() traverses the table is guarded with rcu read lock,
it's safe for us. However, when callers use the node object returned
by tipc_node_find(), there is no rcu read lock applied. Therefore,
this is absolutely unsafe for callers of tipc_node_find().

Now we introduce a reference counter for node structure. Before
tipc_node_find() returns node object to its caller, it first increases
the reference counter. Accordingly, after its caller used it up,
it decreases the counter again. This can prevent a node being used by
one thread from being freed by another thread.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:40:28 -07:00
Ying Xue b952b2befb tipc: fix potential deadlock when all links are reset
[   60.988363] ======================================================
[   60.988754] [ INFO: possible circular locking dependency detected ]
[   60.989152] 3.19.0+ #194 Not tainted
[   60.989377] -------------------------------------------------------
[   60.989781] swapper/3/0 is trying to acquire lock:
[   60.990079]  (&(&n_ptr->lock)->rlock){+.-...}, at: [<ffffffffa0006dca>] tipc_link_retransmit+0x1aa/0x240 [tipc]
[   60.990743]
[   60.990743] but task is already holding lock:
[   60.991106]  (&(&bclink->lock)->rlock){+.-...}, at: [<ffffffffa00004be>] tipc_bclink_lock+0x8e/0xa0 [tipc]
[   60.991738]
[   60.991738] which lock already depends on the new lock.
[   60.991738]
[   60.992174]
[   60.992174] the existing dependency chain (in reverse order) is:
[   60.992174]
-> #1 (&(&bclink->lock)->rlock){+.-...}:
[   60.992174]        [<ffffffff810a9c0c>] lock_acquire+0x9c/0x140
[   60.992174]        [<ffffffff8179c41f>] _raw_spin_lock_bh+0x3f/0x50
[   60.992174]        [<ffffffffa00004be>] tipc_bclink_lock+0x8e/0xa0 [tipc]
[   60.992174]        [<ffffffffa0000f57>] tipc_bclink_add_node+0x97/0xf0 [tipc]
[   60.992174]        [<ffffffffa0011815>] tipc_node_link_up+0xf5/0x110 [tipc]
[   60.992174]        [<ffffffffa0007783>] link_state_event+0x2b3/0x4f0 [tipc]
[   60.992174]        [<ffffffffa00193c0>] tipc_link_proto_rcv+0x24c/0x418 [tipc]
[   60.992174]        [<ffffffffa0008857>] tipc_rcv+0x827/0xac0 [tipc]
[   60.992174]        [<ffffffffa0002ca3>] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
[   60.992174]        [<ffffffff81646e66>] __netif_receive_skb_core+0x746/0x980
[   60.992174]        [<ffffffff816470c1>] __netif_receive_skb+0x21/0x70
[   60.992174]        [<ffffffff81647295>] netif_receive_skb_internal+0x35/0x130
[   60.992174]        [<ffffffff81648218>] napi_gro_receive+0x158/0x1d0
[   60.992174]        [<ffffffff81559e05>] e1000_clean_rx_irq+0x155/0x490
[   60.992174]        [<ffffffff8155c1b7>] e1000_clean+0x267/0x990
[   60.992174]        [<ffffffff81647b60>] net_rx_action+0x150/0x360
[   60.992174]        [<ffffffff8105ec43>] __do_softirq+0x123/0x360
[   60.992174]        [<ffffffff8105f12e>] irq_exit+0x8e/0xb0
[   60.992174]        [<ffffffff8179f9f5>] do_IRQ+0x65/0x110
[   60.992174]        [<ffffffff8179da6f>] ret_from_intr+0x0/0x13
[   60.992174]        [<ffffffff8100de9f>] arch_cpu_idle+0xf/0x20
[   60.992174]        [<ffffffff8109dfa6>] cpu_startup_entry+0x2f6/0x3f0
[   60.992174]        [<ffffffff81033cda>] start_secondary+0x13a/0x150
[   60.992174]
-> #0 (&(&n_ptr->lock)->rlock){+.-...}:
[   60.992174]        [<ffffffff810a8f7d>] __lock_acquire+0x163d/0x1ca0
[   60.992174]        [<ffffffff810a9c0c>] lock_acquire+0x9c/0x140
[   60.992174]        [<ffffffff8179c41f>] _raw_spin_lock_bh+0x3f/0x50
[   60.992174]        [<ffffffffa0006dca>] tipc_link_retransmit+0x1aa/0x240 [tipc]
[   60.992174]        [<ffffffffa0001e11>] tipc_bclink_rcv+0x611/0x640 [tipc]
[   60.992174]        [<ffffffffa0008646>] tipc_rcv+0x616/0xac0 [tipc]
[   60.992174]        [<ffffffffa0002ca3>] tipc_l2_rcv_msg+0x73/0xd0 [tipc]
[   60.992174]        [<ffffffff81646e66>] __netif_receive_skb_core+0x746/0x980
[   60.992174]        [<ffffffff816470c1>] __netif_receive_skb+0x21/0x70
[   60.992174]        [<ffffffff81647295>] netif_receive_skb_internal+0x35/0x130
[   60.992174]        [<ffffffff81648218>] napi_gro_receive+0x158/0x1d0
[   60.992174]        [<ffffffff81559e05>] e1000_clean_rx_irq+0x155/0x490
[   60.992174]        [<ffffffff8155c1b7>] e1000_clean+0x267/0x990
[   60.992174]        [<ffffffff81647b60>] net_rx_action+0x150/0x360
[   60.992174]        [<ffffffff8105ec43>] __do_softirq+0x123/0x360
[   60.992174]        [<ffffffff8105f12e>] irq_exit+0x8e/0xb0
[   60.992174]        [<ffffffff8179f9f5>] do_IRQ+0x65/0x110
[   60.992174]        [<ffffffff8179da6f>] ret_from_intr+0x0/0x13
[   60.992174]        [<ffffffff8100de9f>] arch_cpu_idle+0xf/0x20
[   60.992174]        [<ffffffff8109dfa6>] cpu_startup_entry+0x2f6/0x3f0
[   60.992174]        [<ffffffff81033cda>] start_secondary+0x13a/0x150
[   60.992174]
[   60.992174] other info that might help us debug this:
[   60.992174]
[   60.992174]  Possible unsafe locking scenario:
[   60.992174]
[   60.992174]        CPU0                    CPU1
[   60.992174]        ----                    ----
[   60.992174]   lock(&(&bclink->lock)->rlock);
[   60.992174]                                lock(&(&n_ptr->lock)->rlock);
[   60.992174]                                lock(&(&bclink->lock)->rlock);
[   60.992174]   lock(&(&n_ptr->lock)->rlock);
[   60.992174]
[   60.992174]  *** DEADLOCK ***
[   60.992174]
[   60.992174] 3 locks held by swapper/3/0:
[   60.992174]  #0:  (rcu_read_lock){......}, at: [<ffffffff81646791>] __netif_receive_skb_core+0x71/0x980
[   60.992174]  #1:  (rcu_read_lock){......}, at: [<ffffffffa0002c35>] tipc_l2_rcv_msg+0x5/0xd0 [tipc]
[   60.992174]  #2:  (&(&bclink->lock)->rlock){+.-...}, at: [<ffffffffa00004be>] tipc_bclink_lock+0x8e/0xa0 [tipc]
[   60.992174]

The correct the sequence of grabbing n_ptr->lock and bclink->lock
should be that the former is first held and the latter is then taken,
which exactly happened on CPU1. But especially when the retransmission
of broadcast link is failed, bclink->lock is first held in
tipc_bclink_rcv(), and n_ptr->lock is taken in link_retransmit_failure()
called by tipc_link_retransmit() subsequently, which is demonstrated on
CPU0. As a result, deadlock occurs.

If the order of holding the two locks happening on CPU0 is reversed, the
deadlock risk will be relieved. Therefore, the node lock taken in
link_retransmit_failure() originally is moved to tipc_bclink_rcv()
so that it's obtained before bclink lock. But the precondition of
the adjustment of node lock is that responding to bclink reset event
must be moved from tipc_bclink_unlock() to tipc_node_unlock().

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 12:40:27 -07:00
Jon Paul Maloy 8b4ed8634f tipc: eliminate race condition at dual link establishment
Despite recent improvements, the establishment of dual parallel
links still has a small glitch where messages can bypass each
other. When the second link in a dual-link configuration is
established, part of the first link's traffic will be steered over
to the new link. Although we do have a mechanism to ensure that
packets sent before and after the establishment of the new link
arrive in sequence to the destination node, this is not enough.
The arriving messages will still be delivered upwards in different
threads, something entailing a risk of message disordering during
the transition phase.

To fix this, we introduce a synchronization mechanism between the
two parallel links, so that traffic arriving on the new link cannot
be added to its input queue until we are guaranteed that all
pre-establishment messages have been delivered on the old, parallel
link.

This problem seems to always have been around, but its occurrence is
so rare that it has not been noticed until recent intensive testing.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 14:05:56 -04:00
Jon Paul Maloy 3127a0200d tipc: clean up handling of link congestion
After the recent changes in message importance handling it becomes
possible to simplify handling of messages and sockets when we
encounter link congestion.

We merge the function tipc_link_cong() into link_schedule_user(),
and simplify the code of the latter. The code should now be
easier to follow, especially regarding return codes and handling
of the message that caused the situation.

In case the scheduling function is unable to pre-allocate a wakeup
message buffer, it now returns -ENOBUFS, which is a more correct
code than the previously used -EHOSTUNREACH.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 14:05:56 -04:00
Jon Paul Maloy 1f66d161ab tipc: introduce starvation free send algorithm
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.

We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.

Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 14:05:56 -04:00
Ying Xue bc14b8d6a9 tipc: fix a link reset issue due to retransmission failures
When a node joins a cluster while we are transmitting a fragment
stream over the broadcast link, it's missing the preceding fragments
needed to build a meaningful message. As a result, the node has to
drop it. However, as the fragment message is not acknowledged to
its sender before it's dropped, it accidentally causes link reset
of retransmission failure on the node.

Reported-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25 11:43:32 -04:00
Thomas Graf b5e2c150ac rhashtable: Disable automatic shrinking by default
Introduce a new bool automatic_shrinking to require the
user to explicitly opt-in to automatic shrinking of tables.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 17:48:40 -04:00
Ying Xue ed3e852aa5 tipc: fix compile error when IPV6=m and TIPC=y
When IPV6=m and TIPC=y, below error will appear during building kernel
image:

net/tipc/udp_media.c:196:
undefined reference to `ip6_dst_lookup'
make: *** [vmlinux] Error 1

As ip6_dst_lookup() is implemented in IPV6 and IPV6 is compiled as
module, ip6_dst_lookup() is not built-in core kernel image. As a
result, compiler cannot find 'ip6_dst_lookup' reference while
compiling TIPC code into core kernel image.

But with the method introduced by commit 5f81bd2e5d ("ipv6: export a
stub for IPv6 symbols used by vxlan"), we can avoid the compile error
through "ipv6_stub" pointer to access ip6_dst_lookup().

Fixes: d0f91938be ("tipc: add ip/udp media type")
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 15:20:29 -04:00
Sasha Levin 610600c8c5 tipc: validate length of sockaddr in connect() for dgram/rdm
Commit f2f8036 ("tipc: add support for connect() on dgram/rdm sockets")
hasn't validated user input length for the sockaddr structure which allows
a user to overwrite kernel memory with arbitrary input.

Fixes: f2f8036 ("tipc: add support for connect() on dgram/rdm sockets")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-24 12:50:39 -04:00
Herbert Xu 6d02294981 tipc: Use default rhashtable hashfn
This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  The default is now jhash so removing the setting
makes no difference apart from making one less copy of jhash in
the kernel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:07:51 -04:00
Herbert Xu 6cca7289d5 tipc: Use inlined rhashtable interface
This patch converts tipc to the inlined rhashtable interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Marcelo Ricardo Leitner 446981e5fc tipc: fix build issue when building without IPv6
We can't directly call ipv6_sock_mc_join() but should use the stub
instead and protect it around IS_ENABLED.

Fixes: d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 16:06:27 -04:00
Erik Hugne f2f8036e39 tipc: add support for connect() on dgram/rdm sockets
Following the example of ip4_datagram_connect, we store the
address in the socket structure for dgram/rdm sockets and use
that as the default destination for subsequent send() calls.
It is allowed to connect to any address types, and the behaviour
of send() will be the same as a normal sendto() with this address
provided. Binding to an AF_UNSPEC address clears the association.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 12:25:54 -04:00
Erik Hugne 3bd88ee7a2 tipc: do not report -EHOSTUNREACH for failed local delivery
Since commit 1186adf7df ("tipc: simplify message forwarding and
rejection in socket layer") -EHOSTUNREACH is propagated back to
the sending process if we fail to deliver the message to another
socket local to the node.
This is wrong, host unreachable should only be reported when the
destination port/name does not exist in the cluster, and that
check is always done before sending the message. Also, this
introduces inconsistent sendmsg() behavior for local/remote
destinations. Errors occurring on the receiving side should not
trickle up to the sender. If message delivery fails TIPC should
either discard the packet or reject it back to the sender based
on the destination droppable option.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 12:25:54 -04:00
Erik Hugne 18d6c58415 tipc: remove redundant call to tipc_node_remove_conn
tipc_node_remove_conn may be called twice if shutdown() is
called on a socket that have messages in the receive queue.
Calling this function twice does no harm, but is unnecessary
and we remove the redundant call.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 12:25:54 -04:00
Marcelo Ricardo Leitner 54ff9ef36b ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Herbert Xu 446c89ac1f tipc: Use rhashtable max/min_size instead of max/min_shift
This patch converts tipc to use rhashtable max/min_size instead of
the obsolete max/min_shift.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 12:46:40 -04:00
Ying Xue 2b9bb7f338 tipc: withdraw tipc topology server name when namespace is deleted
The TIPC topology server is a per namespace service associated with the
tipc name {1, 1}. When a namespace is deleted, that name must be withdrawn
before we call sk_release_kernel because the kernel socket release is
done in init_net and trying to withdraw a TIPC name published in another
namespace will fail with an error as:

[  170.093264] Unable to remove local publication
[  170.093264] (type=1, lower=1, ref=2184244004, key=2184244005)

We fix this by breaking the association between the topology server name
and socket before calling sk_release_kernel.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:11:26 -04:00
Ying Xue 8460504bdd tipc: fix a potential deadlock when nametable is purged
[   28.531768] =============================================
[   28.532322] [ INFO: possible recursive locking detected ]
[   28.532322] 3.19.0+ #194 Not tainted
[   28.532322] ---------------------------------------------
[   28.532322] insmod/583 is trying to acquire lock:
[   28.532322]  (&(&nseq->lock)->rlock){+.....}, at: [<ffffffffa000d219>] tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[   28.532322]
[   28.532322] but task is already holding lock:
[   28.532322]  (&(&nseq->lock)->rlock){+.....}, at: [<ffffffffa000e0dc>] tipc_nametbl_stop+0xfc/0x1f0 [tipc]
[   28.532322]
[   28.532322] other info that might help us debug this:
[   28.532322]  Possible unsafe locking scenario:
[   28.532322]
[   28.532322]        CPU0
[   28.532322]        ----
[   28.532322]   lock(&(&nseq->lock)->rlock);
[   28.532322]   lock(&(&nseq->lock)->rlock);
[   28.532322]
[   28.532322]  *** DEADLOCK ***
[   28.532322]
[   28.532322]  May be due to missing lock nesting notation
[   28.532322]
[   28.532322] 3 locks held by insmod/583:
[   28.532322]  #0:  (net_mutex){+.+.+.}, at: [<ffffffff8163e30f>] register_pernet_subsys+0x1f/0x50
[   28.532322]  #1:  (&(&tn->nametbl_lock)->rlock){+.....}, at: [<ffffffffa000e091>] tipc_nametbl_stop+0xb1/0x1f0 [tipc]
[   28.532322]  #2:  (&(&nseq->lock)->rlock){+.....}, at: [<ffffffffa000e0dc>] tipc_nametbl_stop+0xfc/0x1f0 [tipc]
[   28.532322]
[   28.532322] stack backtrace:
[   28.532322] CPU: 1 PID: 583 Comm: insmod Not tainted 3.19.0+ #194
[   28.532322] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   28.532322]  ffffffff82394460 ffff8800144cb928 ffffffff81792f3e 0000000000000007
[   28.532322]  ffffffff82394460 ffff8800144cba28 ffffffff810a8080 ffff8800144cb998
[   28.532322]  ffffffff810a4df3 ffff880013e9cb10 ffffffff82b0d330 ffff880013e9cb38
[   28.532322] Call Trace:
[   28.532322]  [<ffffffff81792f3e>] dump_stack+0x4c/0x65
[   28.532322]  [<ffffffff810a8080>] __lock_acquire+0x740/0x1ca0
[   28.532322]  [<ffffffff810a4df3>] ? __bfs+0x23/0x270
[   28.532322]  [<ffffffff810a7506>] ? check_irq_usage+0x96/0xe0
[   28.532322]  [<ffffffff810a8a73>] ? __lock_acquire+0x1133/0x1ca0
[   28.532322]  [<ffffffffa000d219>] ? tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[   28.532322]  [<ffffffff810a9c0c>] lock_acquire+0x9c/0x140
[   28.532322]  [<ffffffffa000d219>] ? tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[   28.532322]  [<ffffffff8179c41f>] _raw_spin_lock_bh+0x3f/0x50
[   28.532322]  [<ffffffffa000d219>] ? tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[   28.532322]  [<ffffffffa000d219>] tipc_nametbl_remove_publ+0x49/0x2e0 [tipc]
[   28.532322]  [<ffffffffa000e11e>] tipc_nametbl_stop+0x13e/0x1f0 [tipc]
[   28.532322]  [<ffffffffa000dfe5>] ? tipc_nametbl_stop+0x5/0x1f0 [tipc]
[   28.532322]  [<ffffffffa0004bab>] tipc_init_net+0x13b/0x150 [tipc]
[   28.532322]  [<ffffffffa0004a75>] ? tipc_init_net+0x5/0x150 [tipc]
[   28.532322]  [<ffffffff8163dece>] ops_init+0x4e/0x150
[   28.532322]  [<ffffffff810aa66d>] ? trace_hardirqs_on+0xd/0x10
[   28.532322]  [<ffffffff8163e1d3>] register_pernet_operations+0xf3/0x190
[   28.532322]  [<ffffffff8163e31e>] register_pernet_subsys+0x2e/0x50
[   28.532322]  [<ffffffffa002406a>] tipc_init+0x6a/0x1000 [tipc]
[   28.532322]  [<ffffffffa0024000>] ? 0xffffffffa0024000
[   28.532322]  [<ffffffff810002d9>] do_one_initcall+0x89/0x1c0
[   28.532322]  [<ffffffff811b7cb0>] ? kmem_cache_alloc_trace+0x50/0x1b0
[   28.532322]  [<ffffffff810e725b>] ? do_init_module+0x2b/0x200
[   28.532322]  [<ffffffff810e7294>] do_init_module+0x64/0x200
[   28.532322]  [<ffffffff810e9353>] load_module+0x12f3/0x18e0
[   28.532322]  [<ffffffff810e5890>] ? show_initstate+0x50/0x50
[   28.532322]  [<ffffffff810e9a19>] SyS_init_module+0xd9/0x110
[   28.532322]  [<ffffffff8179f3b3>] sysenter_dispatch+0x7/0x1f

Before tipc_purge_publications() calls tipc_nametbl_remove_publ() to
remove a publication with a name sequence, the name sequence's lock
is held. However, when tipc_nametbl_remove_publ() calling
tipc_nameseq_remove_publ() to remove the publication, it first tries
to query name sequence instance with the publication, and then holds
the lock of the found name sequence. But as the lock may be already
taken in tipc_purge_publications(), deadlock happens like above
scenario demonstrated. As tipc_nameseq_remove_publ() doesn't grab name
sequence's lock, the deadlock can be avoided if it's directly invoked
by tipc_purge_publications().

Fixes: 97ede29e80 ("tipc: convert name table read-write lock to RCU")
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:11:26 -04:00
Ying Xue 76100a8a64 tipc: fix netns refcnt leak
When the TIPC module is loaded, we launch a topology server in kernel
space, which in its turn is creating TIPC sockets for communication
with topology server users. Because both the socket's creator and
provider reside in the same module, it is necessary that the TIPC
module's reference count remains zero after the server is started and
the socket created; otherwise it becomes impossible to perform "rmmod"
even on an idle module.

Currently, we achieve this by defining a separate "tipc_proto_kern"
protocol struct, that is used only for kernel space socket allocations.
This structure has the "owner" field set to NULL, which restricts the
module reference count from being be bumped when sk_alloc() for local
sockets is called. Furthermore, we have defined three kernel-specific
functions, tipc_sock_create_local(), tipc_sock_release_local() and
tipc_sock_accept_local(), to avoid the module counter being modified
when module local sockets are created or deleted. This has worked well
until we introduced name space support.

However, after name space support was introduced, we have observed that
a reference count leak occurs, because the netns counter is not
decremented in tipc_sock_delete_local().

This commit remedies this problem. But instead of just modifying
tipc_sock_delete_local(), we eliminate the whole parallel socket
handling infrastructure, and start using the regular sk_create_kern(),
kernel_accept() and sk_release_kernel() calls. Since those functions
manipulate the module counter, we must now compensate for that by
explicitly decrementing the counter after module local sockets are
created, and increment it just before calling sk_release_kernel().

Fixes: a62fbccecd ("tipc: make subscriber server support net namespace")
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Reviewed-by: Jon Maloy <jon.maloy@ericson.com>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Cong Wang <cwang@twopensource.com>
Tested-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17 22:11:26 -04:00
Jon Paul Maloy e3eea1eb47 tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.

There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.

This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.

In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.

This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy 05dcc5aa4d tipc: split link outqueue
struct tipc_link contains one single queue for outgoing packets,
where both transmitted and waiting packets are queued.

This infrastructure is hard to maintain, because we need
to keep a number of fields to keep track of which packets are
sent or unsent, and the number of packets in each category.

A lot of code becomes simpler if we split this queue into a transmission
queue, where sent/unacknowledged packets are kept, and a backlog queue,
where we keep the not yet sent packets.

In this commit we do this separation.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy 2cdf3918e4 tipc: eliminate unnecessary call to broadcast ack function
The unicast packet header contains a broadcast acknowledge sequence
number, that may need to be conveyed to the broadcast link for proper
treatment. Currently, the function tipc_rcv(), which is on the most
critical data path, calls the function tipc_bclink_acknowledge() to
have this done. This call is made for each received packet, and results
in the unconditional grabbing of the broadcast link spinlock.

This is unnecessary, since we can see directly from tipc_rcv() if
the acknowledged number differs from what has been previously acked
from the node in question. In the vast majority of cases the numbers
won't differ, and there is nothing to update.

We now make the call to tipc_bclink_acknowledge() conditional
to that the two ack values differ.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy c1336ee472 tipc: extract bundled buffers by cloning instead of copying
When we currently extract a bundled buffer from a message bundle in
the function tipc_msg_extract(), we allocate a new buffer and explicitly
copy the linear data area.

This is unnecessary, since we can just clone the buffer and do
skb_pull() on the clone to move the data pointer to the correct
position.

This is what we do in this commit.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy 1149557d64 tipc: eliminate unnecessary linearization of incoming buffers
Currently, TIPC linearizes all incoming buffers directly at reception
before passing them upwards in the stack. This is clearly a waste of
CPU resources, and must be avoided.

In this commit, we eliminate this unnecessary linearization. We still
ensure that at least the message header is linear, and that the buffer
is linearized where this is still needed, i.e. when unbundling and when
reversing messages.

In addition, we ensure that fragmented messages are validated after
reassembly before delivering them upwards in the stack.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy cf2157f88a tipc: move message validation function to msg.c
The function link_buf_validate() is in reality re-entrant and context
independent, and will in later commits be called from several locations.
Therefore, we move it to msg.c, make it outline and rename the it to
tipc_msg_validate().

We also redesign the function to make proper use of pskb_may_pull()

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy 7764d6e83d tipc: add framework for node capabilities exchange
The TIPC protocol spec has defined a 13 bit capability bitmap in
the neighbor discovery header, as a means to maintain compatibility
between different code and protocol generations. Until now this field
has been unused.

We now introduce the basic framework for exchanging capabilities
between nodes at first contact. After exchange, a peer node's
capabilities are stored as a 16 bit bitmap in struct tipc_node.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-14 14:38:32 -04:00
Jon Paul Maloy 169bf9121b tipc: ensure that idle links are deleted when a bearer is disabled
commit afaa3f65f6
(tipc: purge links when bearer is disabled) was an attempt to resolve
a problem that turned out to have a more profound reason.

When we disable a bearer, we delete all its pertaining links if
there is no other bearer to perform failover to, or if the module
is shutting down. In case there are dual bearers, we wait with
deleting links until the failover procedure is finished.

However, this misses the case when a link on the removed bearer
was already down, so that there will be no failover procedure to
finish the link delete. This causes confusion if a new bearer is
added to replace the removed one, and also entails a small memory
leak.

This commit takes the current state of the link into account when
deciding when to delete it, and also reverses the above-mentioned
commit.

Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-10 18:37:36 -04:00