TIPC has two internal servers, one providing a subscription
service for topology events, and another providing the
configuration interface. These servers have previously been running
in BH context, accessing the TIPC-port (aka native) API directly.
Apart from these servers, even the TIPC socket implementation is
partially built on this API.
As this API may simultaneously be called via different paths and in
different contexts, a complex and costly lock policiy is required
in order to protect TIPC internal resources.
To eliminate the need for this complex lock policiy, we introduce
a new, generic service API that uses kernel sockets for message
passing instead of the native API. Once the toplogy and configuration
servers are converted to use this new service, all code pertaining
to the native API can be removed. This entails a significant
reduction in code amount and complexity, and opens up for a complete
rework of the locking policy in TIPC.
The new service also solves another problem:
As the current topology server works in BH context, it cannot easily
be blocked when sending of events fails due to congestion. In such
cases events may have to be silently dropped, something that is
unacceptable. Therefore, the new service keeps a dedicated outbound
queue receiving messages from BH context. Once messages are
inserted into this queue, we will immediately schedule a work from a
special workqueue. This way, messages/events from the topology server
are in reality sent in process context, and the server can block
if necessary.
Analogously, there is a new workqueue for receiving messages. Once a
notification about an arriving message is received in BH context, we
schedule a work from the receive workqueue to do the job of
receiving the message in process context.
As both sending and receive messages are now finished in processes,
subscribed events cannot be dropped any more.
As of this commit, this new server infrastructure is built, but
not actually yet called by the existing TIPC code, but since the
conversion changes required in order to use it are significant,
the addition is kept here as a separate commit.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TIPC's implied connect feature, aka piggyback connect, allows
applications to save one syscall and all SYN/SYN-ACK signalling
overhead when setting up a connection. Until now, this has only
been supported for SEQPACKET sockets. Here, we make it possible
to use this feature even with stream sockets.
At the connecting side, the connection is completed when the
first data message arrives from the accepting peer. This means
that we must allow the connecting user to call blocking recv()
before the socket has reached state SS_CONNECTED. So we must must
relax the state machine check at recv_stream(), and allow the
recv() call even if socket is in state SS_CONNECTING.
Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As per feedback from the netdev community, we change the buffer
overflow protection algorithm in receiving sockets so that it
always respects the nominal upper limit set in sk_rcvbuf.
Instead of scaling up from a small sk_rcvbuf value, which leads to
violation of the configured sk_rcvbuf limit, we now calculate the
weighted per-message limit by scaling down from a much bigger value,
still in the same field, according to the importance priority of the
received message.
To allow for administrative tunability of the socket receive buffer
size, we create a tipc_rmem sysctl variable to allow the user to
configure an even bigger value via sysctl command. It is a size of
three (min/default/max) to be consistent with things like tcp_rmem.
By default, the value initialized in tipc_rmem[1] is equal to the
receive socket size needed by a TIPC_CRITICAL_IMPORTANCE message.
This value is also set as the default value of sk_rcvbuf.
Originally-by: Jon Maloy <jon.maloy@ericsson.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
[Ying: added sysctl variation to Jon's original patch]
Signed-off-by: Ying Xue <ying.xue@windriver.com>
[PG: don't compile sysctl.c if not config'd; add Documentation]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Files tipc.h and tipc_config.h were moved to uapi directory, but
the corresponding comments were not updated at the same time.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
adds a socket option for low latency polling.
This allows overriding the global sysctl value with a per-socket one.
Unexport sysctl_net_ll_poll since for now it's not needed in modules.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove NET_LL_RX_POLL from the config menu.
Change default to y.
Busy polling still needs to be enabled at run time.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use sched_clock() instead of get_cycles().
We can use sched_clock() because we don't care much about accuracy.
Remove the dependency on X86_TSC
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no reason for sysctl_net_ll_poll to be an unsigned long.
Change it into an unsigned int.
Fix the proc handler.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we need to bail out for whatever reason during assoc
init, we call sctp_endpoint_put() and then sock_put(), however,
we've hold both refs in reverse, non-symmetric order, so first
sctp_endpoint_hold() and then sock_hold().
Reverse this, so that in an error case we have sock_put() and then
sctp_endpoint_put(). Actually shouldn't matter too much, since both
cleanup paths do the right thing, but that way, it is more consistent
with the rest of the code.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's only used at this one time, so we could remove it as well.
This is valid and also makes it more explicit/obvious that in case
of error the sp->ep is NULL here, i.e. for the sctp_destroy_sock()
check that was recently added.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While this currently cannot trigger any NULL pointer dereference in
sctp_seq_dump_local_addrs(), better change the order of commands to
prevent a future bug to happen. Although we first add SCTP_CMD_NEW_ASOC
and then set the SCTP_CMD_INIT_CHOOSE_TRANSPORT, it is okay for now,
since this primitive is only called by sctp_connect() or sctp_sendmsg()
with sctp_assoc_add_peer() set first. However, lets do this precaution
and first set the transport and then add it to the association hashlist
to prevent in future something to possibly triggering this.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This clearly states a BUG somewhere in the SCTP code as e.g. fixed once
in f28156335 ("sctp: Use correct sideffect command in duplicate cookie
handling"). If this ever happens, throw a trace in the sideeffect engine
where assocs clearly must have a primary_path assigned.
When in sctp_seq_dump_local_addrs() also throw a WARN and bail out since
we do not need to panic for printing this one asterisk. Also, it will
avoid the not so obvious case when primary != NULL test passes and at a
later point in time triggering a NULL ptr dereference caused by primary.
While at it, also fix up the white space.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jesse Gross says:
====================
A few miscellaneous improvements and cleanups before the GRE tunnel
integration series. Intended for net-next/3.11.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is not functional change, this is just code cleanup.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Following patch changes vport->send return type so that vport
layer can do error accounting.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
The get_config vport op is left over from old compatibility code,
it is neither used nor implemented any more.
Signed-off-by: Jesse Gross <jesse@nicira.com>
It is an error to try to change the type of a vport using the set
command. However, while we check that this is an error, we still
proceed to allocate memory which then gets freed immediately.
This stops processing after noticing the error, which does not
actually fix a bug but is more correct.
Signed-off-by: Jesse Gross <jesse@nicira.com>
Add support to change the link state of VF (vPort)
Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add netlink directives and ndo entry to allow for controling
VF link, which can be in one of three states:
Auto - VF link state reflects the PF link state (default)
Up - VF link state is up, traffic from VF to VF works even if
the actual PF link is down
Down - VF link state is down, no traffic from/to this VF, can be of
use while configuring the VF
Signed-off-by: Rony Efraim <ronye@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the Broadcom BCM6345 SoC Ethernet. BCM6345
has a slightly different and older DMA engine which requires the
following modifications:
- the width of the DMA channels on BCM6345 is 64 bytes vs 16 bytes,
which means that the helpers enet_dma{c,s} need to account for this
channel width and we can no longer use macros
- BCM6345 DMA engine does not have any internal SRAM for transfering
buffers
- BCM6345 buffer allocation and flow control is not per-channel but
global (done in RSET_ENETDMA)
- the DMA engine bits are right-shifted by 3 compared to other DMA
generations
- the DMA enable/interrupt masks are a little different (we need to
enabled more bits for 6345)
- some register have the same meaning but are offsetted in the ENET_DMAC
space so a lookup table is required to return the proper offset
The MAC itself is identical and requires no modifications to work.
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
htb_class structures are big, and source of false sharing on SMP.
By carefully splitting them in two parts, we can improve performance.
I got 9 % performance increase on a 24 threads machine, with 200
concurrent netperf in TCP_RR mode, using a HTB hierarchy of 4 classes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Caught by sparse:
- __rcu: missing annotation to sd->flow_limit
- __user: direct access in cpumask_scnprintf
Also
- add endline character when printing bitmap if room in buffer
- avoid bucket overflow by reducing FLOW_LIMIT_HISTORY
The last item warrants some explanation. The hashtable buckets are
subject to overflow if FLOW_LIMIT_HISTORY is larger than or equal
to bucket size, since all packets may end up in a single bucket. The
current (rather arbitrary) history value of 256 happens to match the
buffer size (u8).
As a result, with a single flow, the first 128 packets are accepted
(correct), the second 128 packets dropped (correct) and then the
history[] array has filled, so that each subsequent new packet
causes an increment in the bucket for new_flow plus a decrement
for old_flow: a steady state.
This is fine if packets are dropped, as the steady state goes away
as soon as a mix of traffic reappears. But, because the 256th packet
overflowed the bucket to 0: no packets are dropped.
Instead of explicitly adding an overflow check, this patch changes
FLOW_LIMIT_HISTORY to never be able to overflow a single bucket.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
(first item)
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux sends new unset data during disorder and recovery state if all
(suspected) lost packets have been retransmitted ( RFC5681, section
3.2 step 1 & 2, RFC3517 section 4, NexSeg() Rule 2). One requirement
is to keep the receive window about twice the estimated sender's
congestion window (tcp_rcv_space_adjust()), assuming the fast
retransmits repair the losses in the next round trip.
But currently it's not the case on the first round trip in either
normal or Fast Open connection, beucase the initial receive window
is identical to (expected) sender's initial congestion window. The
fix is to double it.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the SoC specific support is no longer done with help of #ifdef'fery,
we no longer need '__maybe_unused' annotations to sh_eth_select_mii() and
sh_eth_set_duplex()...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reduce the uses of this unnecessary typedef.
Done via perl script:
$ git grep --name-only -w ctl_table net | \
xargs perl -p -i -e '\
sub trim { my ($local) = @_; $local =~ s/(^\s+|\s+$)//g; return $local; } \
s/\b(?<!struct\s)ctl_table\b(\s*\*\s*|\s+\w+)/"struct ctl_table " . trim($1)/ge'
Reflow the modified lines that now exceed 80 columns.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since team functionality relies heavily on userspace daemon, we need to
deliver event to userspace via Netlink as quick as possible. So make all
team port device link events urgent.
Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/ping.c:286:5: sparse: symbol 'ping_check_bind_addr' was not declared. Should it be static?
net/ipv4/ping.c:355:6: sparse: symbol 'ping_set_saddr' was not declared. Should it be static?
net/ipv4/ping.c:370:6: sparse: symbol 'ping_clear_saddr' was not declared. Should it be static?
net/ipv6/ping.c:60:5: sparse: symbol 'dummy_ipv6_recv_error' was not declared. Should it be static?
net/ipv6/ping.c:64:5: sparse: symbol 'dummy_ip6_datagram_recv_ctl' was not declared. Should it be static?
net/ipv6/ping.c:69:5: sparse: symbol 'dummy_icmpv6_err_convert' was not declared. Should it be static?
net/ipv6/ping.c:73:6: sparse: symbol 'dummy_ipv6_icmp_error' was not declared. Should it be static?
net/ipv6/ping.c:75:5: sparse: symbol 'dummy_ipv6_chk_addr' was not declared. Should it be static?
net/ipv6/ping.c:201:5: sparse: symbol 'ping_v6_seq_show' was not declared. Should it be static?
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_device::dev_id should not be used merely to indicate a VI index,
as it affects the way the local part of IPv6 addresses is normally
generated.
This field was intended for use where multiple devices may share a
single assigned MAC address and need to have different IPv6 addresses.
T4 VIs each have their own MAC address.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Dimitris Michailidis <dm@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Corrects an byte order conflict introduced by
158874cac6
("sctp: Correct access to skb->{network, transport}_header").
The values in question are host byte order.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit da12c90e09
"netlink: Add compare function for netlink_table"
only set compare at the time we create kernel netlink,
and reset compare to NULL at the time we finially
release netlink socket, but netlink_lookup wants
the compare exist always.
So we should set compare after we allocate nl_table,
and never reset it. make comapre exist all the time.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 6648bd7e0e (ipv4: Add sysctl knob to control
early socket demux) introduced such sysctl, but forgot to add
doc into Documentation/networking/ip-sysctl.txt. This patch adds it.
Basically I grab the doc from the description of commit 41063e9dd1
(ipv4: Early TCP socket demux.) and the above commit.
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This routine doesn't fail since 9fdc6bef (tuntap: dont use a private kmem_cache)
so it makes sense to compact the code a little bit.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs
"flags" attribute skip one. Knowing whether a device is persistent or not
is critical for checkpoint-restore, thus I propose to add the read-only
IFF_PERSIST one for this.
Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check
for unknown flags being zero and thus there can be trash.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ba418fa357 ("soreuseport: UDP/IPv4 implementation")
added following sparse errors :
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:433:60: expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:433:60: got restricted __be16 [usertype] sport
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:433:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: incorrect type in argument 1 (different base types)
net/ipv4/udp.c:514:60: expected unsigned short [unsigned] [usertype] val
net/ipv4/udp.c:514:60: got restricted __be16 [usertype] sport
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
net/ipv4/udp.c:514:60: warning: cast from restricted __be16
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix following sparse error :
net/ipv4/af_inet.c:1410:59: warning: restricted __be16 degrades to
integer
added in commit db8caf3dbc
("gro: should aggregate frames without DF")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
From: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
This pull request is intended for the 3.11 stream...
One big highlight is the cw1200 driver the ST-E CW1100 & CW1200
WLAN chipsets. This one has been lingering for a while, lacking
some review comments. Once started getting pulled into linux-next,
it got a bit more attention and a number of improvements were made
over the initial cut. No doubt there will be more changes ahead,
but I think it is looking alright at this point.
Along with that, there is the usual flurry of updates to the mac80211
core and the iwlwifi, mwifiex, ath9k, rt2x00, wil6210, and other
drivers. A few of the highlights are some rt2x00 refactoring/cleanup
by Gabor Juhos, some rt2800 hardware support enhancements by Stanislaw
Gruszka, some iwlwifi power management updates from Alexander Bondar,
some enhanced bcma SPROM support from Rafał Miłecki, and a variety
of other things here and there.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9f86134155 (sh_eth: remove SH_ETH_HAS_TSU)
removes 'const' from 'sh_eth_netdev_ops' and modifies it in case TSU registers
are present. I've originally suggested to Iwamatsu-san to split this structure
in two instead and afterwards Dave M. suggested doing the same.
Split 'sh_eth_netdev_ops_tsu' from 'sh_eth_netdev_ops', making both 'const', and
assigning 'ndev->detdev_ops' depending on the presence of TSU registers.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix following sparse errors :
net/ipv4/igmp.c:1222:25: warning: cast from restricted __be32
net/ipv4/igmp.c🔢31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c🔢31: expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c🔢31: got struct ip_mc_list *<noident>
net/ipv4/igmp.c:1250:31: warning: incorrect type in assignment (different address spaces)
net/ipv4/igmp.c:1250:31: expected struct ip_mc_list [noderef] <asn:4>*next_hash
net/ipv4/igmp.c:1250:31: got struct ip_mc_list *<noident>
net/ipv4/igmp.c:2380:37: warning: cast from restricted __be32
These were added by commit e989707135
("igmp: hash a hash table to speedup ip_check_mc_rcu()")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Older Single Queue (SQ_SG_MODE) devices like TSEC (i.e. mpc83xx)
don't feature the frame receive indication bits (RXF) in RSTAT.
For these and for the rest of the SQ_SG_MODE devices, provide the
appropiate polling routine that handles a single pair of Rx/Tx
BD rings, removing the overhead incurred by the multiple queues/
multiple interrupt group devices (veTSEC/ eTSEC2.0 devices).
So this is primarily a fix for the TSEC devices.
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should not use net_device::dev_id to indicate the port number, as
this affects the way the local part of IPv6 addresses is normally
generated.
This field was intended for use where multiple devices may share a
single assigned MAC address and need to have different IPv6 addresses.
Siena's two ports each have their own MAC addresses.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit ef722495c8
( [IPV4]: Remove unused ip_options->is_data) removed the unused is_data
member from ip_options struct.
This patch removes is_data also from the documentation of the ip_options struct.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check the unlikely case of team->en_port_count == 0 before modulo
operation.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes synchronize_rcu() from function
__team_queue_override_port_del(). That can be done because it is ok to
do list_del_rcu() and list_add_tail_rcu() on the same list_head member
without calling synchronize_rcu() in between. A bit of refactoring
needed to be done because INIT_LIST_HEAD needed to be removed (to not
kill the forward pointer) as well.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>