2006-01-03 02:04:38 +08:00
|
|
|
/*
|
|
|
|
* net/tipc/node.h: Include file for TIPC node management routines
|
2007-02-09 22:25:21 +08:00
|
|
|
*
|
2016-09-02 01:52:49 +08:00
|
|
|
* Copyright (c) 2000-2006, 2014-2016, Ericsson AB
|
2014-03-27 12:54:36 +08:00
|
|
|
* Copyright (c) 2005, 2010-2014, Wind River Systems
|
2006-01-03 02:04:38 +08:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
2006-01-11 20:30:43 +08:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
2006-01-03 02:04:38 +08:00
|
|
|
* modification, are permitted provided that the following conditions are met:
|
|
|
|
*
|
2006-01-11 20:30:43 +08:00
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. Neither the names of the copyright holders nor the names of its
|
|
|
|
* contributors may be used to endorse or promote products derived from
|
|
|
|
* this software without specific prior written permission.
|
2006-01-03 02:04:38 +08:00
|
|
|
*
|
2006-01-11 20:30:43 +08:00
|
|
|
* Alternatively, this software may be distributed under the terms of the
|
|
|
|
* GNU General Public License ("GPL") version 2 as published by the Free
|
|
|
|
* Software Foundation.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
|
|
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
|
|
|
* LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
|
|
|
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
|
|
|
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
|
|
|
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
|
|
|
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
|
|
|
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
2006-01-03 02:04:38 +08:00
|
|
|
* POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _TIPC_NODE_H
|
|
|
|
#define _TIPC_NODE_H
|
|
|
|
|
2011-01-01 02:59:19 +08:00
|
|
|
#include "addr.h"
|
|
|
|
#include "net.h"
|
2006-01-03 02:04:38 +08:00
|
|
|
#include "bearer.h"
|
2014-06-26 09:41:33 +08:00
|
|
|
#include "msg.h"
|
2006-01-03 02:04:38 +08:00
|
|
|
|
2015-10-22 20:51:40 +08:00
|
|
|
/* Optional capabilities supported by this code version
|
|
|
|
*/
|
|
|
|
enum {
|
2018-09-29 02:23:21 +08:00
|
|
|
TIPC_SYN_BIT = (1),
|
2016-09-02 01:52:49 +08:00
|
|
|
TIPC_BCAST_SYNCH = (1 << 1),
|
|
|
|
TIPC_BCAST_STATE_NACK = (1 << 2),
|
2017-01-19 02:50:53 +08:00
|
|
|
TIPC_BLOCK_FLOWCTL = (1 << 3),
|
tipc: introduce communication groups
As a preparation for introducing flow control for multicast and datagram
messaging we need a more strictly defined framework than we have now. A
socket must be able keep track of exactly how many and which other
sockets it is allowed to communicate with at any moment, and keep the
necessary state for those.
We therefore introduce a new concept we have named Communication Group.
Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN.
The call takes four parameters: 'type' serves as group identifier,
'instance' serves as an logical member identifier, and 'scope' indicates
the visibility of the group (node/cluster/zone). Finally, 'flags' makes
it possible to set certain properties for the member. For now, there is
only one flag, indicating if the creator of the socket wants to receive
a copy of broadcast or multicast messages it is sending via the socket,
and if wants to be eligible as destination for its own anycasts.
A group is closed, i.e., sockets which have not joined a group will
not be able to send messages to or receive messages from members of
the group, and vice versa.
Any member of a group can send multicast ('group broadcast') messages
to all group members, optionally including itself, using the primitive
send(). The messages are received via the recvmsg() primitive. A socket
can only be member of one group at a time.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-13 17:04:23 +08:00
|
|
|
TIPC_BCAST_RCAST = (1 << 4),
|
2018-07-10 07:07:35 +08:00
|
|
|
TIPC_NODE_ID128 = (1 << 5),
|
2019-03-19 19:49:49 +08:00
|
|
|
TIPC_LINK_PROTO_SEQNO = (1 << 6),
|
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 12:09:51 +08:00
|
|
|
TIPC_MCAST_RBCTL = (1 << 7),
|
2019-07-24 09:56:11 +08:00
|
|
|
TIPC_GAP_ACK_BLOCK = (1 << 8),
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 21:00:41 +08:00
|
|
|
TIPC_TUNNEL_ENHANCED = (1 << 9),
|
|
|
|
TIPC_NAGLE = (1 << 10)
|
2015-10-22 20:51:40 +08:00
|
|
|
};
|
|
|
|
|
2018-09-29 02:23:21 +08:00
|
|
|
#define TIPC_NODE_CAPABILITIES (TIPC_SYN_BIT | \
|
|
|
|
TIPC_BCAST_SYNCH | \
|
|
|
|
TIPC_BCAST_STATE_NACK | \
|
|
|
|
TIPC_BCAST_RCAST | \
|
|
|
|
TIPC_BLOCK_FLOWCTL | \
|
|
|
|
TIPC_NODE_ID128 | \
|
2019-03-19 19:49:49 +08:00
|
|
|
TIPC_LINK_PROTO_SEQNO | \
|
tipc: improve TIPC throughput by Gap ACK blocks
During unicast link transmission, it's observed very often that because
of one or a few lost/dis-ordered packets, the sending side will fastly
reach the send window limit and must wait for the packets to be arrived
at the receiving side or in the worst case, a retransmission must be
done first. The sending side cannot release a lot of subsequent packets
in its transmq even though all of them might have already been received
by the receiving side.
That is, one or two packets dis-ordered/lost and dozens of packets have
to wait, this obviously reduces the overall throughput!
This commit introduces an algorithm to overcome this by using "Gap ACK
blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers
that describes the link deferdq where packets have been got by the
receiving side but with gaps, for example:
link deferdq: [1 2 3 4 10 11 13 14 15 20]
--> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0>
The Gap ACK blocks will be sent to the sending side along with the
traditional ACK or NACK message. Immediately when receiving the message
the sending side will now not only release from its transmq the packets
ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can
be enqueued and transmitted.
In addition, the sending side can now do "multi-retransmissions"
according to the Gaps reported in the Gap ACK blocks.
The new algorithm as verified helps greatly improve the TIPC throughput
especially under packet loss condition.
So far, a maximum of 32 blocks is quite enough without any "Too few Gap
ACK blocks" reports with a 5.0% packet loss rate, however this number
can be increased in the furture if needed.
Also, the patch is backward compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 12:09:51 +08:00
|
|
|
TIPC_MCAST_RBCTL | \
|
2019-07-24 09:56:11 +08:00
|
|
|
TIPC_GAP_ACK_BLOCK | \
|
tipc: add smart nagle feature
We introduce a feature that works like a combination of TCP_NAGLE and
TCP_CORK, but without some of the weaknesses of those. In particular,
we will not observe long delivery delays because of delayed acks, since
the algorithm itself decides if and when acks are to be sent from the
receiving peer.
- The nagle property as such is determined by manipulating a new
'maxnagle' field in struct tipc_sock. If certain conditions are met,
'maxnagle' will define max size of the messages which can be bundled.
If it is set to zero no messages are ever bundled, implying that the
nagle property is disabled.
- A socket with the nagle property enabled enters nagle mode when more
than 4 messages have been sent out without receiving any data message
from the peer.
- A socket leaves nagle mode whenever it receives a data message from
the peer.
In nagle mode, messages smaller than 'maxnagle' are accumulated in the
socket write queue. The last buffer in the queue is marked with a new
'ack_required' bit, which forces the receiving peer to send a CONN_ACK
message back to the sender upon reception.
The accumulated contents of the write queue is transmitted when one of
the following events or conditions occur.
- A CONN_ACK message is received from the peer.
- A data message is received from the peer.
- A SOCK_WAKEUP pseudo message is received from the link level.
- The write queue contains more than 64 1k blocks of data.
- The connection is being shut down.
- There is no CONN_ACK message to expect. I.e., there is currently
no outstanding message where the 'ack_required' bit was set. As a
consequence, the first message added after we enter nagle mode
is always sent directly with this bit set.
This new feature gives a 50-100% improvement of throughput for small
(i.e., less than MTU size) messages, while it might add up to one RTT
to latency time when the socket is in nagle mode.
Acked-by: Ying Xue <ying.xue@windreiver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 21:00:41 +08:00
|
|
|
TIPC_TUNNEL_ENHANCED | \
|
|
|
|
TIPC_NAGLE)
|
|
|
|
|
2015-11-20 03:30:46 +08:00
|
|
|
#define INVALID_BEARER_ID -1
|
2015-10-22 20:51:40 +08:00
|
|
|
|
2015-01-09 15:27:05 +08:00
|
|
|
void tipc_node_stop(struct net *net);
|
2018-04-26 01:29:36 +08:00
|
|
|
bool tipc_node_get_id(struct net *net, u32 addr, u8 *id);
|
tipc: enable tracepoints in tipc
As for the sake of debugging/tracing, the commit enables tracepoints in
TIPC along with some general trace_events as shown below. It also
defines some 'tipc_*_dump()' functions that allow to dump TIPC object
data whenever needed, that is, for general debug purposes, ie. not just
for the trace_events.
The following trace_events are now available:
- trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data,
e.g. message type, user, droppable, skb truesize, cloned skb, etc.
- trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or
queues, e.g. TIPC link transmq, socket receive queue, etc.
- trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g.
sk state, sk type, connection type, rmem_alloc, socket queues, etc.
- trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g.
link state, silent_intv_cnt, gap, bc_gap, link queues, etc.
- trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g.
node state, active links, capabilities, link entries, etc.
How to use:
Put the trace functions at any places where we want to dump TIPC data
or events.
Note:
a) The dump functions will generate raw data only, that is, to offload
the trace event's processing, it can require a tool or script to parse
the data but this should be simple.
b) The trace_tipc_*_dump() should be reserved for a failure cases only
(e.g. the retransmission failure case) or where we do not expect to
happen too often, then we can consider enabling these events by default
since they will almost not take any effects under normal conditions,
but once the rare condition or failure occurs, we get the dumped data
fully for post-analysis.
For other trace purposes, we can reuse these trace classes as template
but different events.
c) A trace_event is only effective when we enable it. To enable the
TIPC trace_events, echo 1 to 'enable' files in the events/tipc/
directory in the 'debugfs' file system. Normally, they are located at:
/sys/kernel/debug/tracing/events/tipc/
For example:
To enable the tipc_link_dump event:
echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable
To enable all the TIPC trace_events:
echo 1 > /sys/kernel/debug/tracing/events/tipc/enable
To collect the trace data:
cat trace
or
cat trace_pipe > /trace.out &
To disable all the TIPC trace_events:
echo 0 > /sys/kernel/debug/tracing/events/tipc/enable
To clear the trace buffer:
echo > trace
d) Like the other trace_events, the feature like 'filter' or 'trigger'
is also usable for the tipc trace_events.
For more details, have a look at:
Documentation/trace/ftrace.txt
MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc
Acked-by: Ying Xue <ying.xue@windriver.com>
Tested-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 10:17:56 +08:00
|
|
|
u32 tipc_node_get_addr(struct tipc_node *node);
|
tipc: handle collisions of 32-bit node address hash values
When a 32-bit node address is generated from a 128-bit identifier,
there is a risk of collisions which must be discovered and handled.
We do this as follows:
- We don't apply the generated address immediately to the node, but do
instead initiate a 1 sec trial period to allow other cluster members
to discover and handle such collisions.
- During the trial period the node periodically sends out a new type
of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
to all the other nodes in the cluster.
- When a node is receiving such a message, it must check that the
presented 32-bit identifier either is unused, or was used by the very
same peer in a previous session. In both cases it accepts the request
by not responding to it.
- If it finds that the same node has been up before using a different
address, it responds with a DSC_TRIAL_FAIL_MSG containing that
address.
- If it finds that the address has already been taken by some other
node, it generates a new, unused address and returns it to the
requester.
- During the trial period the requesting node must always be prepared
to accept a failure message, i.e., a message where a peer suggests a
different (or equal) address to the one tried. In those cases it
must apply the suggested value as trial address and restart the trial
period.
This algorithm ensures that in the vast majority of cases a node will
have the same address before and after a reboot. If a legacy user
configures the address explicitly, there will be no trial period and
messages, so this protocol addition is completely backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23 03:42:51 +08:00
|
|
|
u32 tipc_node_try_addr(struct net *net, u8 *id, u32 addr);
|
|
|
|
void tipc_node_check_dest(struct net *net, u32 onode, u8 *peer_id128,
|
2015-07-31 06:24:22 +08:00
|
|
|
struct tipc_bearer *bearer,
|
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 08:51:21 +08:00
|
|
|
u16 capabilities, u32 signature, u32 hash_mixes,
|
2015-07-31 06:24:22 +08:00
|
|
|
struct tipc_media_addr *maddr,
|
|
|
|
bool *respond, bool *dupl_addr);
|
2015-07-31 06:24:16 +08:00
|
|
|
void tipc_node_delete_links(struct net *net, int bearer_id);
|
2018-04-19 17:06:20 +08:00
|
|
|
void tipc_node_apply_property(struct net *net, struct tipc_bearer *b, int prop);
|
2015-01-09 15:27:05 +08:00
|
|
|
int tipc_node_get_linkname(struct net *net, u32 bearer_id, u32 node,
|
|
|
|
char *linkname, size_t len);
|
2015-07-17 04:54:24 +08:00
|
|
|
int tipc_node_xmit(struct net *net, struct sk_buff_head *list, u32 dnode,
|
|
|
|
int selector);
|
2017-10-13 17:04:21 +08:00
|
|
|
int tipc_node_distr_xmit(struct net *net, struct sk_buff_head *list);
|
2015-07-17 04:54:24 +08:00
|
|
|
int tipc_node_xmit_skb(struct net *net, struct sk_buff *skb, u32 dest,
|
|
|
|
u32 selector);
|
2015-11-20 03:30:42 +08:00
|
|
|
void tipc_node_subscribe(struct net *net, struct list_head *subscr, u32 addr);
|
|
|
|
void tipc_node_unsubscribe(struct net *net, struct list_head *subscr, u32 addr);
|
|
|
|
void tipc_node_broadcast(struct net *net, struct sk_buff *skb);
|
2015-01-09 15:27:05 +08:00
|
|
|
int tipc_node_add_conn(struct net *net, u32 dnode, u32 port, u32 peer_port);
|
|
|
|
void tipc_node_remove_conn(struct net *net, u32 dnode, u32 port);
|
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 08:51:21 +08:00
|
|
|
int tipc_node_get_mtu(struct net *net, u32 addr, u32 sel, bool connected);
|
2017-10-13 17:04:19 +08:00
|
|
|
bool tipc_node_is_up(struct net *net, u32 addr);
|
2016-05-02 23:58:46 +08:00
|
|
|
u16 tipc_node_get_capabilities(struct net *net, u32 addr);
|
2014-11-20 17:29:17 +08:00
|
|
|
int tipc_nl_node_dump(struct sk_buff *skb, struct netlink_callback *cb);
|
2015-11-20 03:30:46 +08:00
|
|
|
int tipc_nl_node_dump_link(struct sk_buff *skb, struct netlink_callback *cb);
|
2015-11-20 03:30:45 +08:00
|
|
|
int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info *info);
|
|
|
|
int tipc_nl_node_get_link(struct sk_buff *skb, struct genl_info *info);
|
|
|
|
int tipc_nl_node_set_link(struct sk_buff *skb, struct genl_info *info);
|
2016-08-18 16:33:52 +08:00
|
|
|
int tipc_nl_peer_rm(struct sk_buff *skb, struct genl_info *info);
|
2014-06-26 09:41:33 +08:00
|
|
|
|
2016-07-26 14:47:19 +08:00
|
|
|
int tipc_nl_node_set_monitor(struct sk_buff *skb, struct genl_info *info);
|
2016-07-26 14:47:20 +08:00
|
|
|
int tipc_nl_node_get_monitor(struct sk_buff *skb, struct genl_info *info);
|
2016-07-26 14:47:22 +08:00
|
|
|
int tipc_nl_node_dump_monitor(struct sk_buff *skb, struct netlink_callback *cb);
|
|
|
|
int tipc_nl_node_dump_monitor_peer(struct sk_buff *skb,
|
|
|
|
struct netlink_callback *cb);
|
tipc: improve throughput between nodes in netns
Currently, TIPC transports intra-node user data messages directly
socket to socket, hence shortcutting all the lower layers of the
communication stack. This gives TIPC very good intra node performance,
both regarding throughput and latency.
We now introduce a similar mechanism for TIPC data traffic across
network namespaces located in the same kernel. On the send path, the
call chain is as always accompanied by the sending node's network name
space pointer. However, once we have reliably established that the
receiving node is represented by a namespace on the same host, we just
replace the namespace pointer with the receiving node/namespace's
ditto, and follow the regular socket receive patch though the receiving
node. This technique gives us a throughput similar to the node internal
throughput, several times larger than if we let the traffic go though
the full network stacks. As a comparison, max throughput for 64k
messages is four times larger than TCP throughput for the same type of
traffic.
To meet any security concerns, the following should be noted.
- All nodes joining a cluster are supposed to have been be certified
and authenticated by mechanisms outside TIPC. This is no different for
nodes/namespaces on the same host; they have to auto discover each
other using the attached interfaces, and establish links which are
supervised via the regular link monitoring mechanism. Hence, a kernel
local node has no other way to join a cluster than any other node, and
have to obey to policies set in the IP or device layers of the stack.
- Only when a sender has established with 100% certainty that the peer
node is located in a kernel local namespace does it choose to let user
data messages, and only those, take the crossover path to the receiving
node/namespace.
- If the receiving node/namespace is removed, its namespace pointer
is invalidated at all peer nodes, and their neighbor link monitoring
will eventually note that this node is gone.
- To ensure the "100% certainty" criteria, and prevent any possible
spoofing, received discovery messages must contain a proof that the
sender knows a common secret. We use the hash mix of the sending
node/namespace for this purpose, since it can be accessed directly by
all other namespaces in the kernel. Upon reception of a discovery
message, the receiver checks this proof against all the local
namespaces'hash_mix:es. If it finds a match, that, along with a
matching node id and cluster id, this is deemed sufficient proof that
the peer node in question is in a local namespace, and a wormhole can
be opened.
- We should also consider that TIPC is intended to be a cluster local
IPC mechanism (just like e.g. UNIX sockets) rather than a network
protocol, and hence we think it can justified to allow it to shortcut the
lower protocol layers.
Regarding traceability, we should notice that since commit 6c9081a3915d
("tipc: add loopback device tracking") it is possible to follow the node
internal packet flow by just activating tcpdump on the loopback
interface. This will be true even for this mechanism; by activating
tcpdump on the involved nodes' loopback interfaces their inter-name
space messaging can easily be tracked.
v2:
- update 'net' pointer when node left/rejoined
v3:
- grab read/write lock when using node ref obj
v4:
- clone traffics between netns to loopback
Suggested-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 08:51:21 +08:00
|
|
|
void tipc_node_pre_cleanup_net(struct net *exit_net);
|
2006-01-03 02:04:38 +08:00
|
|
|
#endif
|