Binding to a v4-mapped address on an AF_INET6 socket should
produce the same result as binding to an IPv4 address on
AF_INET socket. The two are interchangable as v4-mapped
address is really a portability aid.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv4 wildcard (0.0.0.0) address does not intersect
in any way with explicit IPv6 addresses. These two should
be permitted, but the IPv4 conflict code checks the ipv6only
bit as part of the test. Since binding to an explicit IPv6
address restricts the socket to only that IPv6 address, the
side-effect is that the socket behaves as v6-only. By
explicitely setting ipv6only in this case, allows the 2 binds
to succeed.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A socket marked v6-only, can not receive or send traffic to v4-mapped
addresses. Thus allowing binding to v4-mapped address on such a
socket makes no sense.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_sack_swap seems unnecessary so I pushed swap to the caller.
Also removed comment that seemed then pointless, and added include
when not already there. Compile tested.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
dev can be NULL in ip[6]_frag_reasm for skb's coming from RAW sockets.
Quagga's OSPFD sends fragmented packets on a RAW socket, when netfilter
conntrack reassembles them on the OUTPUT path you hit this code path.
You can test it with something like "hping2 -0 -d 2000 -f AA.BB.CC.DD"
With help from Jarek Poplawski.
Signed-off-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes the regressions cause by
commit 1326c3d5a4
(v2.6.28-rc6-461-g23a12b1) broke the display of local and remote
addresses of an SIT tunnel in iproute2.
nt->parms is used by ipip6_tunnel_init() and therefore need to be
initialized first.
Tracked as http://bugzilla.kernel.org/show_bug.cgi?id=12868
Reported-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the behavior of allowing both sysctl and addrconf_dad_failure()
to set the disable_ipv6 parameter without any bad side-effects.
If DAD fails and accept_dad > 1, we will still set disable_ipv6=1,
but then instead of allowing an RA to add an address then
immediately fail DAD, we simply don't allow the address to be
added in the first place. This also lets the user set this flag
and disable all IPv6 addresses on the interface, or on the entire
system.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NEXTHDR_NONE doesn't has an IPv6 option header, so the first check
for the length will always fail and results in a confusing message
"too short" if debugging enabled. With this patch, we check for
NEXTHDR_NONE before length sanity checkings are done.
Signed-off-by: Christoph Paasch <christoph.paasch@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The ip6_queue module is missing the net-pf-16-proto-13 alias that would
cause it to be auto-loaded when a socket of that type is opened. This
patch adds the alias.
Signed-off-by: Scott James Remnant <scott@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch modifies nf_log to use a linked list of loggers for each
protocol. This list of loggers is read and write protected with a
mutex.
This patch separates registration and binding. To be used as
logging module, a module has to register calling nf_log_register()
and to bind to a protocol it has to call nf_log_bind_pf().
This patch also converts the logging modules to the new API. For nfnetlink_log,
it simply switchs call to register functions to call to bind function and
adds a call to nf_log_register() during init. For other modules, it just
remove a const flag from the logger structure and replace it with a
__read_mostly.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Do not try to "uninitialize" ipv6 if its initialization had been skipped
because module parameter disable=1 had been specified.
Reported-by: Thomas Backlund <tmb@mandriva.org>
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Acked-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protocols that use packet_type can be __read_mostly section for better
locality. Elminate any unnecessary initializations of NULL.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add "disable" module parameter support to ipv6.ko by specifying
"disable=1" on module load. We just do the minimum of initializing
inetsw6[] so calls from other modules to inet6_register_protosw()
won't OOPs, then bail out. No IPv6 addresses or sockets can be
created as a result, and a reboot is required to enable IPv6.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a network namespace is destroyed the network interfaces are
all unregistered, making addrconf_ifdown called by the netdevice
notifier.
In the other hand, the addrconf exit method does a loop on the network
devices and does addrconf_ifdown on each of them. But the ordering of
the netns subsystem is not right because it uses the register_pernet_device
instead of register_pernet_subsys. If we handle the loopback as
any network device, we can safely use register_pernet_subsys.
But if we use register_pernet_subsys, the addrconf exit method will do
exactly what was already done with the unregistering of the network
devices. So in definitive, this code is pointless.
I removed the netns addrconf exit method and moved the code to the
addrconf cleanup function.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We already have a valid net in that place, but this is not just a
cleanup - the tw pointer can be NULL there sometimes, thus causing
an oops in NET_NS=y case.
The same place in ipv4 code already works correctly using existing
net, rather than tw's one.
The bug exists since 2.6.27.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions time_before is more robust for comparing
jiffies against other values.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove some pointless conditionals before kfree_skb().
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@@
expression E;
@@
- if (E)
- kfree_skb(E);
+ kfree_skb(E);
// </smpl>
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch changes the return value of nlmsg_notify() as follows:
If NETLINK_BROADCAST_ERROR is set by any of the listeners and
an error in the delivery happened, return the broadcast error;
else if there are no listeners apart from the socket that
requested a change with the echo flag, return the result of the
unicast notification. Thus, with this patch, the unicast
notification is handled in the same way of a broadcast listener
that has set the NETLINK_BROADCAST_ERROR socket flag.
This patch is useful in case that the caller of nlmsg_notify()
wants to know the result of the delivery of a netlink notification
(including the broadcast delivery) and take any action in case
that the delivery failed. For example, ctnetlink can drop packets
if the event delivery failed to provide reliable logging and
state-synchronization at the cost of dropping packets.
This patch also modifies the rtnetlink code to ignore the return
value of rtnl_notify() in all callers. The function rtnl_notify()
(before this patch) returned the error of the unicast notification
which makes rtnl_set_sk_err() reports errors to all listeners. This
is not of any help since the origin of the change (the socket that
requested the echoing) notices the ENOBUFS error if the notification
fails and should resync itself.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix this sparse warning:
net/ipv6/xfrm6_state.c:72:26: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reader/writer lock in ip_tables is acquired in the critical path of
processing packets and is one of the reasons just loading iptables can cause
a 20% performance loss. The rwlock serves two functions:
1) it prevents changes to table state (xt_replace) while table is in use.
This is now handled by doing rcu on the xt_table. When table is
replaced, the new table(s) are put in and the old one table(s) are freed
after RCU period.
2) it provides synchronization when accesing the counter values.
This is now handled by swapping in new table_info entries for each cpu
then summing the old values, and putting the result back onto one
cpu. On a busy system it may cause sampling to occur at different
times on each cpu, but no packet/byte counts are lost in the process.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Sucessfully tested on my dual quad core machine too, but iptables only (no ipv6 here)
BTW, my new "tbench 8" result is 2450 MB/s, (it was 2150 MB/s not so long ago)
Acked-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ip6_tables netfilter module can use an ifname_compare() helper
so that two loops are unfolded.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Concern has been expressed about the changing Kconfig options.
Provide the old options that forward-select.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Suggested by: James King <t.james.king@gmail.com>
Similarly to commit c9fd496809, merge
TTL and HL. Since HL does not depend on any IPv6-specific function,
no new module dependencies would arise.
With slight adjustments to the Kconfig help text.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch adds a logging message for invalid new icmpv6 packet.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Later patches change the locking on xt_table and the initialization of
the lock element is not needed since the lock is always initialized in
xt_table_register anyway.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch fixes a trivial typo that was adding a new line at end of
the nf_log_packet() prefix. It also make the logging conditionnal by
adding a LOG_INVALID test.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When the user creates IPv6 over IPv6 tunnel, the device name created
by the kernel isn't set to t->parm.name, which is referred as the
result of ioctl().
Signed-off-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes connection tracking handling for ICMPv6 messages
related to Stateless Address Autoconfiguration, MLD, and MLDv2. They
can not be tracked because they are massively using multicast (on
pre-defined address). But they are not invalid and should not be
detected as such.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch fixes a typo in the inverse mapping of Node Information
request. Following draft-ietf-ipngwg-icmp-name-lookups-09, "Querier"
sends a type 139 (ICMPV6_NI_QUERY) packet to "Responder" which answer
with a type 140 (ICMPV6_NI_REPLY) packet.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Just like PKTINFO, limit the options area to 64K.
Based upon report by Eric Sesterhenn and analysis by
Roland Dreier.
Signed-off-by: David S. Miller <davem@davemloft.net>
As the options passed to ip6_append_data may be ephemeral, we need
to duplicate it for corking. This patch applies the simplest fix
which is to memdup all the relevant bits.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Base versions handle constant folding now.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv6/ip6mr.c: In function 'pim6_rcv':
net/ipv6/ip6mr.c:368: error: implicit declaration of function 'csum_ipv6_magic'
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately simplicity isn't always the best. The fraginfo
interface turned out to be suboptimal. The problem was quite
obvious. For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.
LRO didn't have this problem because it directly read the headers
from the frags structure.
This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it. Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The kernel manages this value internally, as necessary, as
VIFs are added/removed and as multicast routers are registered
and deregistered.
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch addresses the IPv6 multicast routing issues described
below. It was tested with XORP 1.4/1.5 as the IPv6 PIM-SM routing
daemon against FreeBSD peers.
net/ipv6/ip6_input.c:
- Don't try to forward link-local multicast packets.
- Don't reset skb2->dev before calling ip6_mr_input() so packets can
be identified as coming from the PIM register vif properly.
net/ipv6/ip6mr.c:
- Fix incoming PIM register messages processing:
* The IPv6 pseudo-header should be included when checksumming PIM
messages (RFC 4601 section 4.9; RFC 3973 section 4.7.1).
* Packets decapsulated from PIM register messages should have
skb->protocol ETH_P_IPV6.
- Enable/disable IPv6 multicast forwarding on the corresponding
interface when a routing daemon adds/removes a multicast virtual
interface.
- Remove incorrect skb_pull() to fix userspace signaling.
- Enable/disable global IPv6 multicast forwarding when an IPv6
multicast routing socket is opened/closed.
net/ipv6/route.c:
- Don't use strict routing logic for packets decapsulated from PIM
register messages (similar to disabling rp_filter for the IPv4
case).
Signed-off-by: Thomas Goff <thomas.goff@boeing.com>
Reviewed-by: Fred Templin <fred.l.templin@boeing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the xfrm reverse flow lookup for icmp6 so that icmp6 packets
don't get lost over ipsec tunnels. Similar patch is in RHEL5 kernel for a quite
long time and I do not see why it isn't in mainline.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to perform skb_postpull_rcsum after pulling the IPv6
header in order to maintain the correctness of the complete
checksum.
This patch also adds a missing iph reload after pulling.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a fib6 table dump is prematurely ended, we won't unlink
its walker from the list. This causes all sorts of grief for
other users of the list later.
Reported-by: Chris Caputo <ccaputo@alt.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
An old bug crept back into the ICMP/ICMPv6 conntrack protocols: the timeout
values are defined as unsigned longs, the sysctl's maxsize is set to
sizeof(unsigned int). Use unsigned int for the timeout values as in the
other conntrack protocols.
Reported-by: Jean-Mickael Guerin <jean-mickael.guerin@6wind.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (22 commits)
ioat: fix self test for multi-channel case
dmaengine: bump initcall level to arch_initcall
dmaengine: advertise all channels on a device to dma_filter_fn
dmaengine: use idr for registering dma device numbers
dmaengine: add a release for dma class devices and dependent infrastructure
ioat: do not perform removal actions at shutdown
iop-adma: enable module removal
iop-adma: kill debug BUG_ON
iop-adma: let devm do its job, don't duplicate free
dmaengine: kill enum dma_state_client
dmaengine: remove 'bigref' infrastructure
dmaengine: kill struct dma_client and supporting infrastructure
dmaengine: replace dma_async_client_register with dmaengine_get
atmel-mci: convert to dma_request_channel and down-level dma_slave
dmatest: convert to dma_request_channel
dmaengine: introduce dma_request_channel and private channels
net_dma: convert to dma_find_channel
dmaengine: provide a common 'issue_pending_all' implementation
dmaengine: centralize channel allocation, introduce dma_find_channel
dmaengine: up-level reference counting to the module level
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (84 commits)
wimax: fix kernel-doc for debufs_dentry member of struct wimax_dev
net: convert pegasus driver to net_device_ops
bnx2x: Prevent eeprom set when driver is down
net: switch kaweth driver to netdevops
pcnet32: round off carrier watch timer
i2400m/usb: wrap USB power saving in #ifdef CONFIG_PM
wimax: testing for rfkill support should also test for CONFIG_RFKILL_MODULE
wimax: fix kconfig interactions with rfkill and input layers
wimax: fix '#ifndef CONFIG_BUG' layout to avoid warning
r6040: bump release number to 0.20
r6040: warn about MAC address being unset
r6040: check PHY status when bringing interface up
r6040: make printks consistent with DRV_NAME
gianfar: Fixup use of BUS_ID_SIZE
mlx4_en: Returning real Max in get_ringparam
mlx4_en: Consider inline packets on completion
netdev: bfin_mac: enable bfin_mac net dev driver for BF51x
qeth: convert to net_device_ops
vlan: add neigh_setup
dm9601: warn on invalid mac address
...
This patch adds GRO support for TCP over IPv6. The code is exactly
the same as the IPv4 version except for the pseudo-header checksum
computation.
Note that I've removed the unused tcphdr argument from tcp_v6_check
rather than invent a bogus value for GRO.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds GRO support for IPv6. IPv6 GRO supports extension
headers in the same way as GSO (by using the same infrastructure).
It's also simpler compared to IPv4 since we no longer have to worry
about fragmentation attributes or header checksums.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the general-purpose channel allocation provided by dmaengine.
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks to excellent diagnosis by Eduard Guzovsky.
The core problem is that on a network with lots of active
multicast traffic, the neighbour cache can fill up. If
we try to allocate a new route and thus neighbour cache
entry, the bog-standard GC attempt the neighbour layer does
in ineffective because route entries hold a reference
to the existing neighbour entries and GC can only liberate
entries with no references.
IPV4 already has a way to handle this, by doing a route cache
GC in such situations (when neigh attach returns -ENOBUFS).
So simply mimick this on the ipv6 side.
Tested-by: Eduard Guzovsky <eguzovsky@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we converted the protocol atomic counters such as the orphan
count and the total socket count deadlocks were introduced due to
the mismatch in BH status of the spots that used the percpu counter
operations.
Based on the diagnosis and patch by Peter Zijlstra, this patch
fixes these issues by disabling BH where we may be in process
context.
Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1429 commits)
net: Allow dependancies of FDDI & Tokenring to be modular.
igb: Fix build warning when DCA is disabled.
net: Fix warning fallout from recent NAPI interface changes.
gro: Fix potential use after free
sfc: If AN is enabled, always read speed/duplex from the AN advertising bits
sfc: When disabling the NIC, close the device rather than unregistering it
sfc: SFT9001: Add cable diagnostics
sfc: Add support for multiple PHY self-tests
sfc: Merge top-level functions for self-tests
sfc: Clean up PHY mode management in loopback self-test
sfc: Fix unreliable link detection in some loopback modes
sfc: Generate unique names for per-NIC workqueues
802.3ad: use standard ethhdr instead of ad_header
802.3ad: generalize out mac address initializer
802.3ad: initialize ports LACPDU from const initializer
802.3ad: remove typedef around ad_system
802.3ad: turn ports is_individual into a bool
802.3ad: turn ports is_enabled into a bool
802.3ad: make ntt bool
ixgbe: Fix set_ringparam in ixgbe to use the same memory pools.
...
Fixed trivial IPv4/6 address printing conflicts in fs/cifs/connect.c due
to the conversion to %pI (in this networking merge) and the addition of
doing IPv6 addresses (from the earlier merge of CIFS).
1.When no interface is specified in an IPV6_PKTINFO ancillary data
item, the interface specified in an IPV6_PKTINFO sticky optionis
is used.
RFC3542:
6.7. Summary of Outgoing Interface Selection
This document and [RFC-3493] specify various methods that affect the
selection of the packet's outgoing interface. This subsection
summarizes the ordering among those in order to ensure deterministic
behavior.
For a given outgoing packet on a given socket, the outgoing interface
is determined in the following order:
1. if an interface is specified in an IPV6_PKTINFO ancillary data
item, the interface is used.
2. otherwise, if an interface is specified in an IPV6_PKTINFO sticky
option, the interface is used.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When get receiving interface index while no message is received,
the the value seted with setsockopt() should be returned.
RFC 3542:
Issuing getsockopt() for the above options will return the sticky
option value i.e., the value set with setsockopt(). If no sticky
option value has been set getsockopt() will return the following
values:
- For the IPV6_PKTINFO option, it will return an in6_pktinfo
structure with ipi6_addr being in6addr_any and ipi6_ifindex being
zero.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are three reasons for me to add this support:
1.When no interface is specified in an IPV6_PKTINFO ancillary data
item, the interface specified in an IPV6_PKTINFO sticky optionis
is used.
RFC3542:
6.7. Summary of Outgoing Interface Selection
This document and [RFC-3493] specify various methods that affect the
selection of the packet's outgoing interface. This subsection
summarizes the ordering among those in order to ensure deterministic
behavior.
For a given outgoing packet on a given socket, the outgoing interface
is determined in the following order:
1. if an interface is specified in an IPV6_PKTINFO ancillary data
item, the interface is used.
2. otherwise, if an interface is specified in an IPV6_PKTINFO sticky
option, the interface is used.
2.When no IPV6_PKTINFO ancillary data is received,getsockopt() should
return the sticky option value which set with setsockopt().
RFC 3542:
Issuing getsockopt() for the above options will return the sticky
option value i.e., the value set with setsockopt(). If no sticky
option value has been set getsockopt() will return the following
values:
3.Make the setsockopt implementation POSIX compliant.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This last patch makes the appropriate changes to use and propagate the
network namespace where needed in IPv6 multicast forwarding code.
This consists mainly in replacing all the remaining init_net occurences
with current netns pointer retrieved from sockets, net devices or
mfc6_caches depending on the routines' contexts.
Some routines receive a new 'struct net' parameter to propagate the current
netns:
* ip6mr_get_route
* ip6mr_cache_report
* ip6mr_cache_find
* ip6mr_cache_unresolved
* mif6_add/mif6_delete
* ip6mr_mfc_add/ip6mr_mfc_delete
* ip6mr_reg_vif
All the IPv6 multicast forwarding variables moved to struct netns_ipv6 by
the previous patches are now referenced in the correct namespace.
Changelog:
==========
* Take into account the net associated to mfc6_cache when matching entries in
mfc_unres_queue list.
* Call mroute_clean_tables() in ip6mr_net_exit() to free memory allocated
per-namespace.
* Call dev_net_set() in ip6mr_reg_vif() to initialize dev->nd_net
correctly.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Declare IPv6 multicast forwarding /proc/net entries per-namespace:
/proc/net/ip6_mr_vif
/proc/net/ip6_mr_cache
Changelog
=========
V2:
* In routine ipmr_mfc_seq_idx(), only match entries belonging to current
netns in mfc_unres_queue list.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast forwarding netns-aware.
Declare variable 'reg_vif_num' per-namespace, moves into struct netns_ipv6.
At the moment, this variable is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast forwarding netns-aware.
Declare IPv6 multicast forwarding variables 'mroute_do_assert' and
'mroute_do_pim' per-namespace in struct netns_ipv6.
At the moment, these variables are only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast forwarding netns-aware.
Declare variable cache_resolve_queue_len per-namespace: moves it into
struct netns_ipv6.
This variable counts the number of unresolved cache entries queued in the
list mfc_unres_queue. This list is kept global to all netns as the number
of entries per namespace is limited to 10 (hardcoded in routine
ip6mr_cache_unresolved).
Entries belonging to different namespaces in mfc_unres_queue will be
identified by matching the mfc_net member introduced previously in
struct mfc6_cache.
Keeping this list global to all netns, also allows us to keep a single
timer (ipmr_expire_timer) to handle their expiration.
In some places cache_resolve_queue_len value was tested for arming
or deleting the timer. These tests were equivalent to testing
mfc_unres_queue value instead and are replaced in this patch.
At the moment, cache_resolve_queue_len is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast forwarding netns-aware.
Dynamically allocates IPv6 multicast forwarding cache, mfc6_cache_array,
and moves it to struct netns_ipv6.
At the moment, mfc6_cache_array is only referenced in init_net.
Replace 'ARRAY_SIZE(mfc6_cache_array)' with mfc6_cache_array size: MFC6_LINES.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch stores into struct mfc6_cache the network namespace each
mfc6_cache belongs to. The new member is mfc6_net.
mfc6_net is assigned at cache allocation and doesn't change during
the rest of the cache entry life.
This will help to retrieve the current netns around the IPv6 multicast
forwarding code.
At the moment, all mfc6_cache are allocated in init_net.
Changelog:
==========
* Use write_pnet()/read_pnet() to set and get mfc6_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast forwarding netns-aware.
Dynamically allocates interface table vif6_table and moves it to
struct netns_ipv6, and updates MIF_EXISTS() macro.
At the moment, vif6_table is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Preliminary work to make IPv6 multicast forwarding netns-aware.
Make IPv6 multicast forwarding mroute6_socket per-namespace,
moves it into struct netns_ipv6.
At the moment, mroute6_socket is only referenced in init_net.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes minor annoyance during transmission of unsolicited
neighbor advertisements from userspace to multicast addresses (as
far as I can see in RFC, this is allowed and the similar functionality
for IPv4 has been in arping for a long time).
Outgoing multicast packets get reinserted into local processing as if they
are received from the network. The machine thus sees its own NA and fills
the logs with error messages. This patch removes the message if NA has been
generated locally.
Signed-off-by: Jan Sembera <jsembera@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today, iproute2 fails to show multicast forwarding unresolved cache
entries while scanning /proc/net/ip_mr_cache.
Indeed, it expects to see -1 in 'Iif' column to identify unresolved
entries but the kernel outputs 65535. It's a signed/unsigned issue:
'Iif', the source interface, is retrieved from member mfc_parent in
struct mfc_cache. mfc_parent is a vifi_t: unsigned short, but is
displayed in ipmr_mfc_seq_show() as "%-3d", signed integer.
In unresolevd entries, the 65535 value (0xFFFF) comes from this define:
#define ALL_VIFS ((vifi_t)(-1))
That may explains why the guy who added support for this in iproute2
thought a -1 should be expected.
I don't know if this must be fixed in kernel or in iproute2. Who is
right? What is the correct API? How was it designed originally?
I let you decide if it should goes in the kernel or be fixed in iproute2.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
/proc/net/ip_mr_cache and /proc/net/ip6_mr_cache displays garbage when
showing unresolved mfc_cache entries.
[root@qemu tests]# cat /proc/net/ip_mr_cache
Group Origin Iif Pkts Bytes Wrong Oifs
014C00EF 010014AC 1 10 10050 0 2:1 3:1
024C00EF 010014AC 65535 514 2 -559067475
The first line is correct. It is a resolved cache entry, 10 packets used it...
The second line represents an unresolved entry, and the columns Pkts(4th),
Bytes(5th) and Wrong(6th) just show garbage.
In struct mfc_cache, there's an union to store data for resolved and
unresolved cases. And what ipmr_mfc_seq_show() is printing in these
columns for the unresolved entries is some bytes from mfc_cache.mfc_un.res.
Bad.
(eg. In our case -559067475 is in fact 0xdead4ead which is the spinlock
magic from mfc_cache.mfc_un.unres.unresolved.lock.magic).
This patch replaces the garbage data written in these columns for the
unresolved entries by '0' (zeros) which is more correct.
This change doesn't break the ABI.
Also, mfc->mfc_un.res.pkt, mfc->mfc_un.res.bytes, mfc->mfc_un.res.wrong_if
are unsigned long.
It applies on top of net-next-2.6.
The patch for net-2.6 is slightly different because of the NIP6_FMT to
%pI6 conversion that was made in the seq_printf.
Changelog:
==========
V2:
* Instead of breaking the ABI by suppressing the columns that have no
meaning for unresolved entries, fill them with 0 values.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
fs/nfsd/nfs4recover.c
Manually fixed above to use new creds API functions, e.g.
nfs4_save_creds().
Signed-off-by: James Morris <jmorris@namei.org>
Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers.
Note : We revert commit (248969ae31
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass netns pointer to struct xfrm_policy_afinfo::garbage_collect()
[This needs more thoughts on what to do with dst_ops]
[Currently stub to init_net]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns
to flow_cache_lookup() and resolver callback.
Take it from socket or netdevice. Stub DECnet to init_net.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To avoid unnecessary complications with passing netns around.
* set once, very early after allocating
* once set, never changes
For a while create every xfrm_state in init_net.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
this warning:
net/ipv6/ip6_flowlabel.c: In function ‘ipv6_flowlabel_opt’:
net/ipv6/ip6_flowlabel.c:467: warning: ‘err’ may be used uninitialized in this function
triggers because GCC does not recognize the (correct) error flow
between fl_create() and 'err'.
Annotate it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch let nfmark to be evaluated for routing decision for OUTPUT
packet, in mangle table, when process paquet in NFQUEUE. This patch is
an IPv6 port of Laurent Licour IPv4 one.
Signed-off-by: Eric Leblond <eric@inl.fr>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This is the last step to be able to perform full RCU lookups
in __inet_lookup() : After established/timewait tables, we
add RCU lookups to listening hash table.
The only trick here is that a socket of a given type (TCP ipv4,
TCP ipv6, ...) can now flight between two different tables
(established and listening) during a RCU grace period, so we
must use different 'nulls' end-of-chain values for two tables.
We define a large value :
#define LISTENING_NULLS_BASE (1U << 29)
So that slots in listening table are guaranteed to have different
end-of-chain values than slots in established table. A reader can
still detect it finished its lookup in the right chain.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove redundant argument comments in files of net/*
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks.
/proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'.
This should speedup writers, since spin_lock()/spin_unlock()
only use one atomic operation instead of two for write_lock()/write_unlock()
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like IPV4, convert the tunnel virtual devices to use net_device_ops.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to new network device ops interface.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because "name" is static, it can be occasionally be filled with
somewhat garbage if two processes read /proc/net/snmp6.
Also, remove useless casts and "-1" -- snprintf() correctly terminates it's
output.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6mr.c, /proc entries /proc/net/ip6_mr_cache and /proc/net/ip6_mr_vif
are opened with seq_open_private(), thus seq_release_private() should be
used to release them.
Should fix a small memory leak.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch prepares RCU migration of listening_hash table for
TCP/DCCP protocols.
listening_hash table being small (32 slots per protocol), we add
a spinlock for each slot, instead of a single rwlock for whole table.
This should reduce hold time of readers, and writers concurrency.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert to net_device_ops function table pointer for ioctl.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first argument to csum_partial is const void *
casts to char/u8 * are not necessary
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.
This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.
Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.
__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)
Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a straightforward patch, using hlist_nulls infrastructure.
RCUification already done on UDP two weeks ago.
Using hlist_nulls permits us to avoid some memory barriers, both
at lookup time and delete time.
Patch is large because it adds new macros to include/net/sock.h.
These macros will be used by TCP & DCCP in next patch.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
security/keys/internal.h
security/keys/process_keys.c
security/keys/request_key.c
Fixed conflicts above by using the non 'tsk' versions.
Signed-off-by: James Morris <jmorris@namei.org>
Attach creds to file structs and discard f_uid/f_gid.
file_operations::open() methods (such as hppfs_open()) should use file->f_cred
rather than current_cred(). At the moment file->f_cred will be current_cred()
at this point.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Signed-off-by: James Morris <jmorris@namei.org>
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: netdev@vger.kernel.org
Signed-off-by: James Morris <jmorris@namei.org>
This patch fixes two bugs:
1. setsockopt() of anything but a Type 2 routing header should return
EINVAL instead of EPERM. Noticed by Shan Wei
(shanwei@cn.fujitsu.com).
2. setsockopt()/sendmsg() of a Type 2 routing header with invalid
length or segments should return EINVAL. These values are statically
fixed in RFC 3775, unlike the variable Type 0 was.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The order of cleanup operations in the error/exit section of ip6_mr_init()
is completely inversed. It should be the other way around.
Also a del_timer() is missing in the error path.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds better IPv6 failover support for bonding devices,
especially when in active-backup mode and there are only IPv6 addresses
configured, as reported by Alex Sidorenko.
- Creates a new file, net/drivers/bonding/bond_ipv6.c, for the
IPv6-specific routines. Both regular bonds and VLANs over bonds
are supported.
- Adds a new tunable, num_unsol_na, to limit the number of unsolicited
IPv6 Neighbor Advertisements that are sent on a failover event.
Default is 1.
- Creates two new IPv6 neighbor discovery functions:
ndisc_build_skb()
ndisc_send_skb()
These were required to support VLANs since we have to be able to
add the VLAN id to the skb since ndisc_send_na() and friends
shouldn't be asked to do this. These two routines are basically
__ndisc_send() split into two pieces, in a slightly different order.
- Updates Documentation/networking/bonding.txt and bumps the rev of bond
support to 3.4.0.
On failover, this new code will generate one packet:
- An unsolicited IPv6 Neighbor Advertisement, which helps the switch
learn that the address has moved to the new slave.
Testing has shown that sending just the NA results in pretty good
behavior when in active-back mode, I saw no lost ping packets for example.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
With some net devices types, an IPv6 address configured while the
interface was down can stay 'tentative' forever, even after the interface
is set up. In some case, pending IPv6 DADs are not executed when the
device becomes ready.
I observed this while doing some tests with kvm. If I assign an IPv6
address to my interface eth0 (kvm driver rtl8139) when it is still down
then the address is flagged tentative (IFA_F_TENTATIVE). Then, I set
eth0 up, and to my surprise, the address stays 'tentative', no DAD is
executed and the address can't be pinged.
I also observed the same behaviour, without kvm, with virtual interfaces
types macvlan and veth.
Some easy steps to reproduce the issue with macvlan:
1. ip link add link eth0 type macvlan
2. ip -6 addr add 2003::ab32/64 dev macvlan0
3. ip addr show dev macvlan0
...
inet6 2003::ab32/64 scope global tentative
...
4. ip link set macvlan0 up
5. ip addr show dev macvlan0
...
inet6 2003::ab32/64 scope global tentative
...
Address is still tentative
I think there's a bug in net/ipv6/addrconf.c, addrconf_notify():
addrconf_dad_run() is not always run when the interface is flagged IF_READY.
Currently it is only run when receiving NETDEV_CHANGE event. Looks like
some (virtual) devices doesn't send this event when becoming up.
For both NETDEV_UP and NETDEV_CHANGE events, when the interface becomes
ready, run_pending should be set to 1. Patch below.
'run_pending = 1' could be moved below the if/else block but it makes
the code less readable.
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
While adding MIGRATE support to strongSwan, Andreas Steffen noticed that
the selectors provided in XFRM_MSG_ACQUIRE have their family field
uninitialized (those in MIGRATE do have their family set).
Looking at the code, this is because the af-specific init_tempsel()
(called via afinfo->init_tempsel() in xfrm_init_tempsel()) do not set
the value.
Reported-by: Andreas Steffen <andreas.steffen@strongswan.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
I want to compile out proc_* and sysctl_* handlers totally and
stub them to NULL depending on config options, however usage of &
will prevent this, since taking adress of NULL pointer will break
compilation.
So, drop & in front of every ->proc_handler and every ->strategy
handler, it was never needed in fact.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP packets received in udpv6_recvmsg() are not only IPv6 UDP packets, but
also have IPv4 UDP packets, so when do the counter of UDP_MIB_INERRORS in
udpv6_recvmsg(), we should check whether the packet is a IPv6 UDP packet
or a IPv4 UDP packet.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If UDP echo is sent to xinetd/echo-dgram, the UDP reply will be received
at the sender. But the SNMP counter of UDP_MIB_INDATAGRAMS will be not
increased, UDP6_MIB_INDATAGRAMS will be increased instead.
Endpoint A Endpoint B
UDP Echo request ----------->
(IPv4, Dst port=7)
<---------- UDP Echo Reply
(IPv4, Src port=7)
This bug is come from this patch cb75994ec3.
It do counter UDP[6]_MIB_INDATAGRAMS until udp[v6]_recvmsg. Because
xinetd used IPv6 socket to receive UDP messages, thus, when received
UDP packet, the UDP6_MIB_INDATAGRAMS will be increased in function
udpv6_recvmsg() even if the packet is a IPv4 UDP packet.
This patch fixed the problem.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current UDP multicast delivery is not namespace aware.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC4301 Section 7.1 says:
"7.1. Tunnel Mode SAs that Carry Initial and Non-Initial Fragments
All implementations MUST support tunnel mode SAs that are configured
to pass traffic without regard to port field (or ICMP type/code or
Mobility Header type) values. If the SA will carry traffic for
specified protocols, the selector set for the SA MUST specify the
port fields (or ICMP type/code or Mobility Header type) as ANY. An
SA defined in this fashion will carry all traffic including initial
and non-initial fragments for the indicated Local/Remote addresses
and specified Next Layer protocol(s)."
But for IPv6, fragment is treated as a protocol. This change catches
protocol transported in fragmented packet. In IPv4, there is no
problem.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u
can be replaced with %pI4
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc warns when using the # modifier with the %p format specifier,
so we can't use this to omit the colons when needed, introduces
%pi6 instead.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Corey Minyard found a race added in commit 271b72c7fa
(udp: RCU handling for Unicast packets.)
"If the socket is moved from one list to another list in-between the
time the hash is calculated and the next field is accessed, and the
socket has moved to the end of the new list, the traversal will not
complete properly on the list it should have, since the socket will
be on the end of the new list and there's not a way to tell it's on a
new list and restart the list traversal. I think that this can be
solved by pre-fetching the "next" field (with proper barriers) before
checking the hash."
This patch corrects this problem, introducing a new
sk_for_each_rcu_safenext() macro.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Goals are :
1) Optimizing handling of incoming Unicast UDP frames, so that no memory
writes should happen in the fast path.
Note: Multicasts and broadcasts still will need to take a lock,
because doing a full lockless lookup in this case is difficult.
2) No expensive operations in the socket bind/unhash phases :
- No expensive synchronize_rcu() calls.
- No added rcu_head in socket structure, increasing memory needs,
but more important, forcing us to use call_rcu() calls,
that have the bad property of making sockets structure cold.
(rcu grace period between socket freeing and its potential reuse
make this socket being cold in CPU cache).
David did a previous patch using call_rcu() and noticed a 20%
impact on TCP connection rates.
Quoting Cristopher Lameter :
"Right. That results in cacheline cooldown. You'd want to recycle
the object as they are cache hot on a per cpu basis. That is screwed
up by the delayed regular rcu processing. We have seen multiple
regressions due to cacheline cooldown.
The only choice in cacheline hot sensitive areas is to deal with the
complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
- Because udp sockets are allocated from dedicated kmem_cache,
use of SLAB_DESTROY_BY_RCU can help here.
Theory of operation :
---------------------
As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.
Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.
In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.
We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP sockets are hashed in a 128 slots hash table.
This hash table is protected by *one* rwlock.
This rwlock is readlocked each time an incoming UDP message is handled.
This rwlock is writelocked each time a socket must be inserted in
hash table (bind time), or deleted from this table (close time)
This is not scalable on SMP machines :
1) Even in read mode, lock() and unlock() are atomic operations and
must dirty a contended cache line, shared by all cpus.
2) A writer might be starved if many readers are 'in flight'. This can
happen on a machine with some NIC receiving many UDP messages. User
process can be delayed a long time at socket creation/dismantle time.
This patch prepares RCU migration, by introducing 'struct udp_table
and struct udp_hslot', and using one spinlock per chain, to reduce
contention on central rwlock.
Introducing one spinlock per chain reduces latencies, for port
randomization on heavily loaded UDP servers. This also speedup
bindings to specific ports.
udp_lib_unhash() was uninlined, becoming to big.
Some cleanups were done to ease review of following patch
(RCUification of UDP Unicast lookups)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The define in kernel.h can be done away with at a later time.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netfilter: replace old NF_ARP calls with NFPROTO_ARP
netfilter: fix compilation error with NAT=n
netfilter: xt_recent: use proc_create_data()
netfilter: snmp nat leaks memory in case of failure
netfilter: xt_iprange: fix range inversion match
netfilter: netns: use NFPROTO_NUMPROTO instead of NUMPROTO for tables array
netfilter: ctnetlink: remove obsolete NAT dependency from Kconfig
pkt_sched: sch_generic: Fix oops in sch_teql
dccp: Port redirection support for DCCP
tcp: Fix IPv6 fallout from 'Port redirection support for TCP'
netdev: change name dropping error codes
ipvs: Update CONFIG_IP_VS_IPV6 description and help text
'tcp: Port redirection support for TCP' (a3116ac5c) added a new member
to inet_request_sock() which inet_csk_clone() makes use of but failed
to add proper initialization to the IPv6 syncookie code and missed a
couple of places where the new member should be used instead of
inet_sk(sk)->sport.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
name and nlen parameters passed to ->strategy hook are unused, remove
them. In general ->strategy hook should know what it's doing, and don't
do something tricky for which, say, pointer to original userspace array
may be needed (name).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net> [ networking bits ]
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Problem observed:
In IPv6, in the presence of multiple routers candidates to
default gateway in one segment, each sending a different
value of preference, the Linux hosts connected to the
segment weren't selecting the right one in all the
combinations possible of LOW/MEDIUM/HIGH preference.
This patch changes two files:
include/linux/icmpv6.h
Get the "router_pref" bitfield in the right place
(as RFC4191 says), named the bit left with this fix as
"home_agent" (RFC3775 say that's his function)
net/ipv6/ndisc.c
Corrects the binary logic behind the updating of the
router preference in the flags of the routing table
Result:
With this two fixes applied, the default route used by
the system was to consistent with the rules mentioned
in RFC4191 in case of changes in the value of preference
in router advertisements
Signed-off-by: Pedro Ribeiro <pribeiro@net.ipl.pt>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
LD net/ipv6/ipv6.o
WARNING: net/ipv6/ipv6.o(.text+0xd8): Section mismatch in reference from the function inet6_net_init() to the function .init.text:ipv6_init_mibs()
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix error with CONFIG_TCP_MD5SIG disabled.
Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
$ codiff tcp_ipv6.o.old tcp_ipv6.o.new
net/ipv6/tcp_ipv6.c:
tcp_v6_md5_hash_hdr | -144
tcp_v6_send_ack | -585
tcp_v6_send_reset | -540
3 functions changed, 1269 bytes removed, diff: -1269
net/ipv6/tcp_ipv6.c:
tcp_v6_send_response | +791
1 function changed, 791 bytes added, diff: +791
tcp_ipv6.o.new:
4 functions changed, 791 bytes added, 1269 bytes removed, diff: -478
I choose to leave the reset related netns comment in place (not
the one that is killed) as I cannot understand its English so
it's a bit hard for me to evaluate its usefulness :-).
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Maybe it's just me but I guess those md5 people made a mess
out of it by having *_md5_hash_* to use daddr, saddr order
instead of the one that is natural (and equal to what csum
functions use). For the segment were sending, the original
addresses are reversed so buff's saddr == skb's daddr and
vice-versa.
Maybe I can finally proceed with unification of some code
after fixing it first... :-)
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
More breakage :-), part of timestamps just were previously
overwritten.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking for some common code I came across difference
in checksum calculation between tcp_v6_send_(reset|ack) I
couldn't explain. I checked both v4 and v6 and found out that
both seem to have the same "feature". I couldn't find anything
in rfc nor anywhere else which would state that md5 option
should be ignored like it was in case of reset so I came to
a conclusion that this is probably a genuine bug. I suspect
that addition of md5 just was fooled by the excessive
copy-paste code in those functions and the reset part was
never tested well enough to find out the problem.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
By passing in the family through which extensions were invoked, a bit
of data space can be reclaimed. The "family" member will be added to
the parameter structures and the check functions be adjusted.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for target extensions' destroy functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for target extensions' checkentry functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for target extensions' target functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for match extensions' destroy functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch does this for match extensions' checkentry functions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The function signatures for Xtables extensions have grown over time.
It involves a lot of typing/replication, and also a bit of stack space
even if they are not used. Realize an NFWS2008 idea and pack them into
structs. The skb remains outside of the struct so gcc can continue to
apply its optimizations.
This patch does this for match extensions' match functions.
A few ambiguities have also been addressed. The "offset" parameter for
example has been renamed to "fragoff" (there are so many different
offsets already) and "protoff" to "thoff" (there is more than just one
protocol here, so clarify).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
ip6t_LOG does certainly not depend on the filter table.
(Also, move it so that menuconfig still displays it correctly.)
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It used to be that {ip,ip6,etc}_tables called extension->checkentry
themselves, but this can be moved into the xtables core.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This is cleaner, we already know conntrack to which event is relevant.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Again, it's deducible from skb, but we're going to use it for
nf_conntrack_checksum and statistics, so just pass it from upper layer.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
It's deducible from skb->dev or skb->dst->dev, but we know netns at
the moment of call, so pass it down and use for finding and creating
conntracks.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Take netns from skb->dst->dev. It should be safe because, they are called
from LOCAL_OUT hook where dst is valid (though, I'm not exactly sure about
IPVS and queueing packets to userspace).
[Patrick: its safe everywhere since they already expect skb->dst to be set]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
* make per-netns conntrack hash
Other solution is to add ->ct_net pointer to tuplehashes and still has one
hash, I tried that it's ugly and requires more code deep down in protocol
modules et al.
* propagate netns pointer to where needed, e. g. to conntrack iterators.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Now that dev_net() exists, the usefullness of them is even less. Also they're
a big problem in resolving circular header dependencies necessary for
NOTRACK-in-netns patch. See below.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
and (try to) consistently use u_int8_t for the L3 family.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The content of init_ipv6_mibs/cleanup_ipv6_mibs will be moved to new
calls one by one next.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
idev has been stored on seq->private. NULL has been stored for global
statistics.
The situation is changed with net namespace. We need to store pointer to
struct net and the only place is seq->private. So, we'll have for
/proc/net/dev_snmp6/* and for /proc/net/snmp6 pointers of two different
types stored in the same field.
This effectively requires to separate seq_ops of these files.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Simple, comsolidate sockstat6 staff in one place, at the beginning of
the file. Right now sockstat6_seq_open/sockstat6_seq_fops looks like an
intrusion in the middle of snmp6 code.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes that ip6_route_net_init() does all of the route init code.
There used to be a race between ip6_route_net_init() and ip6_net_init()
and someone relying on the combined result was left out cold.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_net_init() error handling looked less than solid, fix 'er up.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the socket cached in the skb if it's present.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To be able to use the cached socket reference in the skb during input
processing we add a new set of lookup functions that receive the skb on
their argument list.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet_iif() in inet_sock.h requires route.h. Since users of inet_iif()
usually require other route.h functionality anyway this patch moves
inet_iif() to route.h.
Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu>
Signed-off-by: David S. Miller <davem@davemloft.net>
The following actions are possible:
tcp_v6_rcv
skb->dev = NULL;
tcp_v6_do_rcv
tcp_v6_hnd_req
tcp_check_req
req->rsk_ops->send_ack == tcp_v6_send_ack
So, skb->dev can be NULL in tcp_v6_send_ack. We must obtain namespace
from dst entry.
Thanks to Vitaliy Gusev <vgusev@openvz.org> for initial problem finding
in IPv4 code.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code ignores rules for internal options in HBH/DST options
header in packet processing if 'Not strict' mode is specified (which is not
implemented). Clearly it is not expected by user.
Kernel should reject HBH/DST rule insertion with 'Not strict' mode
in the first place.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
inet6_rsk() is called on a struct request_sock * before we
have checked whether the socket is an ipv6 socket or a ipv6-
mapped ipv4 socket. The access that triggers this is the
inet_rsk(rsk)->inet6_rsk_offset dereference in inet6_rsk().
This is arguably not a critical error as the inet6_rsk_offset
is only used to compute a pointer which is never really used
(in the code path in question) anyway. But it might be a
latent error, so let's fix it.
Spotted by kmemcheck.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg reported that occaisionally, bringing an interface
down or unregistering it would hang for up to 30 seconds. Using
debugging output he provided it became clear that ICMP6 routes
were the culprit.
The problem is that ICMP6 routes live in their own world totally
separate from normal ipv6 routes. So there are all kinds of special
cases throughout the ipv6 code to handle this.
While we should really try to unify all of this stuff somehow,
for the time being let's fix this by purging the ICMP6 routes
that match the device in question during rt6_ifdown().
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes kernel bugzilla 11469: "TUN with 1024 neighbours:
ip6_dst_lookup_tail NULL crash"
dst->neighbour is not necessarily hooked up at this point
in the processing path, so blindly dereferencing it is
the wrong thing to do. This NULL check exists in other
similar paths and this case was just an oversight.
Also fix the completely wrong and confusing indentation
here while we're at it.
Based upon a patch by Evgeniy Polyakov.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
How to reproduce ?
- create a network namespace
- use tcp protocol and get timewait socket
- exit the network namespace
- after a moment (when the timewait socket is destroyed), the kernel
panics.
# BUG: unable to handle kernel NULL pointer dereference at
0000000000000007
IP: [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
PGD 119985067 PUD 11c5c0067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: ipv6 button battery ac loop dm_mod tg3 libphy ext3 jbd
edd fan thermal processor thermal_sys sg sata_svw libata dock serverworks
sd_mod scsi_mod ide_disk ide_core [last unloaded: freq_table]
Pid: 0, comm: swapper Not tainted 2.6.27-rc2 #3
RIP: 0010:[<ffffffff821e394d>] [<ffffffff821e394d>]
inet_twdr_do_twkill_work+0x6e/0xb8
RSP: 0018:ffff88011ff7fed0 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffffffff82339420 RCX: ffff88011ff7ff30
RDX: 0000000000000001 RSI: ffff88011a4d03c0 RDI: ffff88011ac2fc00
RBP: ffffffff823392e0 R08: 0000000000000000 R09: ffff88002802a200
R10: ffff8800a5c4b000 R11: ffffffff823e4080 R12: ffff88011ac2fc00
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000041cbd940(0000) GS:ffff8800bff839c0(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000007 CR3: 00000000bd87c000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff8800bff9e000, task
ffff88011ff76690)
Stack: ffffffff823392e0 0000000000000100 ffffffff821e3a3a
0000000000000008
0000000000000000 ffffffff821e3a61 ffff8800bff7c000 ffffffff8203c7e7
ffff88011ff7ff10 ffff88011ff7ff10 0000000000000021 ffffffff82351108
Call Trace:
<IRQ> [<ffffffff821e3a3a>] ? inet_twdr_hangman+0x0/0x9e
[<ffffffff821e3a61>] ? inet_twdr_hangman+0x27/0x9e
[<ffffffff8203c7e7>] ? run_timer_softirq+0x12c/0x193
[<ffffffff820390d1>] ? __do_softirq+0x5e/0xcd
[<ffffffff8200d08c>] ? call_softirq+0x1c/0x28
[<ffffffff8200e611>] ? do_softirq+0x2c/0x68
[<ffffffff8201a055>] ? smp_apic_timer_interrupt+0x8e/0xa9
[<ffffffff8200cad6>] ? apic_timer_interrupt+0x66/0x70
<EOI> [<ffffffff82011f4c>] ? default_idle+0x27/0x3b
[<ffffffff8200abbd>] ? cpu_idle+0x5f/0x7d
Code: e8 01 00 00 4c 89 e7 41 ff c5 e8 8d fd ff ff 49 8b 44 24 38 4c 89 e7
65 8b 14 25 24 00 00 00 89 d2 48 8b 80 e8 00 00 00 48 f7 d0 <48> 8b 04 d0
48 ff 40 58 e8 fc fc ff ff 48 89 df e8 c0 5f 04 00
RIP [<ffffffff821e394d>] inet_twdr_do_twkill_work+0x6e/0xb8
RSP <ffff88011ff7fed0>
CR2: 0000000000000007
This patch provides a function to purge all timewait sockets related
to a network namespace. The timewait sockets life cycle is not tied with
the network namespace, that means the timewait sockets stay alive while
the network namespace dies. The timewait sockets are for avoiding to
receive a duplicate packet from the network, if the network namespace is
freed, the network stack is removed, so no chance to receive any packets
from the outside world. Furthermore, having a pending destruction timer
on these sockets with a network namespace freed is not safe and will lead
to an oops if the timer callback which try to access data belonging to
the namespace like for example in:
inet_twdr_do_twkill_work
-> NET_INC_STATS_BH(twsk_net(tw), LINUX_MIB_TIMEWAITED);
Purging the timewait sockets at the network namespace destruction will:
1) speed up memory freeing for the namespace
2) fix kernel panic on asynchronous timewait destruction
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Braino: net.ipv6 in ipv6 skeleton has no business in rotable
class
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a problem spotted with zebra, but not sure if it is
necessary a kernel problem. With IPV6 when an address is added to an
interface, Zebra creates a duplicate RIB entry, one as a connected
route, and other as a kernel route.
When an address is added to an interface the RTN_NEWADDR message
causes Zebra to create a connected route. In IPV4 when an address is
added to an interface a RTN_NEWROUTE message is set to user space with
the protocol RTPROT_KERNEL. Zebra ignores these messages, because it
already has the connected route.
The problem is that route created in IPV6 has route protocol ==
RTPROT_BOOT. Was this a design decision or a bug? This fixes it. Same
patch applies to both net-2.6 and stable.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When get receiving interface index while no message is received,
the bounded device's index of the socket should be returned.
RFC 3542:
Issuing getsockopt() for the above options will return the sticky
option value i.e., the value set with setsockopt(). If no sticky
option value has been set getsockopt() will return the following
values:
- For the IPV6_PKTINFO option, it will return an in6_pktinfo
structure with ipi6_addr being in6addr_any and ipi6_ifindex being
zero.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6_dev_get_saddr() blindly de-references dst_dev to get the network
namespace, but some callers might pass NULL. Change callers to pass a
namespace pointer instead.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan wrote:
> On Thu, Aug 07, 2008 at 07:00:56PM +0200, John Gumb wrote:
>> Scenario: no ipv6 default route set.
>
>> # ip -f inet6 route get fec0::1
>>
>> BUG: unable to handle kernel NULL pointer dereference at 00000000
>> IP: [<c0369b85>] rt6_fill_node+0x175/0x3b0
>> EIP is at rt6_fill_node+0x175/0x3b0
>
> 0xffffffff80424dd3 is in rt6_fill_node (net/ipv6/route.c:2191).
> 2186 } else
> 2187 #endif
> 2188 NLA_PUT_U32(skb, RTA_IIF, iif);
> 2189 } else if (dst) {
> 2190 struct in6_addr saddr_buf;
> 2191 ====> if (ipv6_dev_get_saddr(ip6_dst_idev(&rt->u.dst)->dev,
> ^^^^^^^^^^^^^^^^^^^^^^^^
> NULL
>
> 2192 dst, 0, &saddr_buf) == 0)
> 2193 NLA_PUT(skb, RTA_PREFSRC, 16, &saddr_buf);
> 2194 }
The commit that changed this can't be reverted easily, but the patch
below works for me.
Fix NULL de-reference in rt6_fill_node() when there's no IPv6 input
device present in the dst entry.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The socket lock is there to protect the normal UDP receive path.
Encapsulation UDP sockets don't need that protection. In fact
the locking is deadly for them as they may contain another UDP
packet within, possibly with the same addresses.
Also the nested bit was copied from TCP. TCP needs it because
of accept(2) spawning sockets. This simply doesn't apply to UDP
so I've removed it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here's a revised version, based on Herbert's comments, of a fix for
the ipv4-inner, ipv6-outer interfamily ipsec beet mode. It fixes the
network header adjustment during interfamily, as well as makes sure
that we reserve enough room for the new ipv6 header if we might have
something else as the inner family. Also, the ipv4 pseudo header
construction was added.
Signed-off-by: Joakim Koskela <jookos@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces dst_metric() with dst_mtu() in net/ipv6/route.c.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old code will drop IPv6 packet if ipfragok is not set, since
ipfragok is obsoleted, will be instead by used skb->local_df, so this
check must be changed to skb->local_df.
This patch fix this problem and not drop packet if skb->local_df is
set to true.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When Set Hop-by-Hop options header with NULL data
pointer and optlen is not zero use setsockopt(),
the kernel successfully return 0 instead of
return error EINVAL or EFAULT.
This patch fix the problem.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cookie_v6_check() did not call reqsk_free() if xfrm_lookup() fails,
leaking the request sock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reported by Stefanos Harhalakis; although 2.6.27-rc1 talks to itself using IPv6
TCP MD5 packets just fine, Stefanos noted that tcpdump claimed that the
signatures were invalid.
I broke this in 49a72dfb88 ("tcp: Fix MD5
signatures for non-linear skbs"), it was just a typo.
Note that tcpdump will still sometimes claim that the signatures are incorrect.
A patch to tcpdump has been submitted for this[1].
[1] http://tinyurl.com/6a4fl2
Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
I noticed, looking at tcpdumps, that timewait ACKs were getting sent
with an incorrect MD5 signature when signatures were enabled.
I broke this in 49a72dfb88 ("tcp: Fix
MD5 signatures for non-linear skbs"). I didn't take into account that
the skb passed to tcp_*_send_ack was the inbound packet, thus the
source and dest addresses need to be swapped when calculating the MD5
pseudoheader.
Signed-off-by: Adam Langley <agl@imperialviolet.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP used ip6_xmit() to send fragments after received ICMP packet too
big message. But while send packet used ip6_xmit, the skb->local_df is
not initialized. So when skb if enter ip6_fragment(), the following
code will discard the skb.
ip6_fragment(...)
{
if (!skb->local_df) {
...
return -EMSGSIZE;
}
...
}
SCTP do the following step:
1. send packet ip6_xmit(skb, ipfragok=0)
2. received ICMP packet too big message
3. if PMTUD_ENABLE: ip6_xmit(skb, ipfragok=1)
This patch fixed the problem by set local_df if ipfragok is true.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This call is no longer needed, sockstat6 is per namespace so it is
removed at the namespace subsystem destruction.
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an error occured, datagram_send_ctl() should exit immediately rather than
continue to run the for loop. Otherwise, the variable err might be changed and
the error might be hidden.
Fix this bug by using "goto" instead of "break".
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Piss-poor sysctl registration API strikes again, film at 11...
What we really need is _pathname_ required to be present in
already registered table, so that kernel could warn about bad
order. That's the next target for sysctl stuff (and generally
saner and more explicit order of initialization of ipv[46]
internals wouldn't hurt either).
For the time being, here are full fixups required by ..._rotable()
stuff; we make per-net sysctl sets descendents of "ro" one and
make sure that sufficient skeleton is there before we start registering
per-net sysctls.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/ipv4/ipcomp.c: In function ‘ipcomp4_init_state’:
net/ipv4/ipcomp.c:109: warning: unused variable ‘calg_desc’
net/ipv4/ipcomp.c:108: warning: unused variable ‘ipcd’
net/ipv4/ipcomp.c:107: warning: ‘err’ may be used uninitialized in this function
net/ipv6/ipcomp6.c: In function ‘ipcomp6_init_state’:
net/ipv6/ipcomp6.c:139: warning: unused variable ‘calg_desc’
net/ipv6/ipcomp6.c:138: warning: unused variable ‘ipcd’
net/ipv6/ipcomp6.c:137: warning: ‘err’ may be used uninitialized in this function
Signed-off-by: David S. Miller <davem@davemloft.net>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
netns: fix ip_rt_frag_needed rt_is_expired
netfilter: nf_conntrack_extend: avoid unnecessary "ct->ext" dereferences
netfilter: fix double-free and use-after free
netfilter: arptables in netns for real
netfilter: ip{,6}tables_security: fix future section mismatch
selinux: use nf_register_hooks()
netfilter: ebtables: use nf_register_hooks()
Revert "pkt_sched: sch_sfq: dump a real number of flows"
qeth: use dev->ml_priv instead of dev->priv
syncookies: Make sure ECN is disabled
net: drop unused BUG_TRAP()
net: convert BUG_TRAP to generic WARN_ON
drivers/net: convert BUG_TRAP to generic WARN_ON
Currently not visible, because NET_NS is mutually exclusive with SYSFS
which is required by SECURITY.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ecn_ok is not initialized when a connection is established by cookies.
The cookie syn-ack never sets ECN, so ecn_ok must be set to 0.
Spotted using ns-3/network simulation cradle simulator and valgrind.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.
I could make at least one BUILD_BUG_ON conversion.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>