Having frag_list members which holds wmem of an sk leads to nightmares
with partially cloned frag skb's. The reason is that once you unleash
a skb with a frag_list that has individual sk ownerships into the stack
you can never undo those ownerships safely as they may have been cloned
by things like netfilter. Since we have to undo them in order to make
skb_linearize happy this approach leads to a dead-end.
So let's go the other way and make this an invariant:
For any skb on a frag_list, skb->sk must be NULL.
That is, the socket ownership always belongs to the head skb.
It turns out that the implementation is actually pretty simple.
The above invariant is actually violated in the following patch
for a short duration inside ip_fragment. This is OK because the
offending frag_list member is either destroyed at the end of the
slow path without being sent anywhere, or it is detached from
the frag_list before being sent.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
I found a bug that stopped IPsec/IPv6 from working. About
a month ago IPv6 started using rt6i_idev->dev on the cached socket dst
entries. If the cached socket dst entry is IPsec, then rt6i_idev will
be NULL.
Since we want to look at the rt6i_idev of the original route in this
case, the easiest fix is to store rt6i_idev in the IPsec dst entry just
as we do for a number of other IPv6 route attributes. Unfortunately
this means that we need some new code to handle the references to
rt6i_idev. That's why this patch is bigger than it would otherwise be.
I've also done the same thing for IPv4 since it is conceivable that
once these idev attributes start getting used for accounting, we
probably need to dereference them for IPv4 IPsec entries too.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's recap the problem. The current asynchronous netlink kernel
message processing is vulnerable to these attacks:
1) Hit and run: Attacker sends one or more messages and then exits
before they're processed. This may confuse/disable the next netlink
user that gets the netlink address of the attacker since it may
receive the responses to the attacker's messages.
Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
c) Restrict/prohibit binding.
2) Starvation: Because various netlink rcv functions were written
to not return until all messages have been processed on a socket,
it is possible for these functions to execute for an arbitrarily
long period of time. If this is successfully exploited it could
also be used to hold rtnl forever.
Proposed solutions:
a) Synchronous processing.
b) Stream mode socket.
Firstly let's cross off solution c). It only solves the first
problem and it has user-visible impacts. In particular, it'll
break user space applications that expect to bind or communicate
with specific netlink addresses (pid's).
So we're left with a choice of synchronous processing versus
SOCK_STREAM for netlink.
For the moment I'm sticking with the synchronous approach as
suggested by Alexey since it's simpler and I'd rather spend
my time working on other things.
However, it does have a number of deficiencies compared to the
stream mode solution:
1) User-space to user-space netlink communication is still vulnerable.
2) Inefficient use of resources. This is especially true for rtnetlink
since the lock is shared with other users such as networking drivers.
The latter could hold the rtnl while communicating with hardware which
causes the rtnetlink user to wait when it could be doing other things.
3) It is still possible to DoS all netlink users by flooding the kernel
netlink receive queue. The attacker simply fills the receive socket
with a single netlink message that fills up the entire queue. The
attacker then continues to call sendmsg with the same message in a loop.
Point 3) can be countered by retransmissions in user-space code, however
it is pretty messy.
In light of these problems (in particular, point 3), we should implement
stream mode netlink at some point. In the mean time, here is a patch
that implements synchronous processing.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Converts remaining rtnetlink_link tables to use c99 designated
initializers to make greping a little bit easier.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
I made a mistake in my last patch to the raw socket checksum code.
I used the value of inet->cork.length as the length of the payload.
While this works with normal packets, it breaks down when IPsec is
present since the cork length includes the extension header length.
So here is a patch to fix the length calculations.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
On Mon, Apr 25, 2005 at 12:01:13PM -0400, Dave Jones wrote:
> This has been brought up before.. http://lkml.org/lkml/2000/1/21/116
> but didnt seem to get resolved. This morning I got someone
> file a bugzilla about it breaking sysctl(8).
And here's its ipv6 counterpart.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A lot of places in there are including major.h for no reason whatsoever.
Removed. And yes, it still builds.
The history of that stuff is often amusing. E.g. for net/core/sock.c
the story looks so, as far as I've been able to reconstruct it: we used
to need major.h in net/socket.c circa 1.1.early. In 1.1.13 that need
had disappeared, along with register_chrdev(SOCKET_MAJOR, "socket",
&net_fops) in sock_init(). Include had not. When 1.2 -> 1.3 reorg of
net/* had moved a lot of stuff from net/socket.c to net/core/sock.c,
this crap had followed...
Signed-off-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Please apply, SCTP/DCCP needs this when INET_REFCNT_DEBUG
is set.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SELinux hooks invoke ipv6_skip_exthdr() with an incorrect
length final argument. However, the length argument turns out
to be superfluous.
I was just reading ipv6_skip_exthdr and it occured to me that we can
get rid of len altogether. The only place where len is used is to
check whether the skb has two bytes for ipv6_opt_hdr. This check
is done by skb_header_pointer/skb_copy_bits anyway.
Now it might appear that we've made the code slower by deferring
the check to skb_copy_bits. However, this check should not trigger
in the common case so this is OK.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
While looking at this problem I noticed that IPv6 was sometimes
looking at inet->recverr which is bogus. Here is a patch to
correct that and use np->recverr.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
So here is a patch that introduces skb_store_bits -- the opposite of
skb_copy_bits, and uses them to read/write the csum field in rawv6.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!