When we've merged skb's with page frags, and subsequently receive
a trailer skb (< MSS) that is not completely non-linear (this can
occur on Intel NICs if the packet size falls below the threshold),
GRO ends up producing an illegal GSO skb with a frag_list.
This is harmless unless the skb is then forwarded through an
interface that requires software GSO, whereupon the GSO code
will BUG.
This patch detects this case in GRO and avoids merging the
trailer skb.
Reported-by: Mark Wagner <mwagner@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
NAPI drivers try to recycle SKBs in their polling routine, but we
generally don't know the context in which the polling will be called,
and the skb recycling itself may require IRQs to be enabled.
This patch adds irqs_disabled() test to the skb_recycle_check()
routine, so that we'll not let the drivers hit the skb recycling
path with IRQs disabled.
As a side effect, this patch actually disables skb recycling for some
[broken] drivers. E.g. gianfar driver grabs an irqsave spinlock during
TX ring processing, and then tries to recycle an skb, and that caused
the following badness:
nf_conntrack version 0.5.0 (1008 buckets, 4032 max)
------------[ cut here ]------------
Badness at kernel/softirq.c:143
NIP: c003e3c4 LR: c423a528 CTR: c003e344
...
NIP [c003e3c4] local_bh_enable+0x80/0xc4
LR [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
Call Trace:
[c15d1b60] [c003e32c] local_bh_disable+0x1c/0x34 (unreliable)
[c15d1b70] [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
[c15d1b80] [c02c6370] nf_conntrack_destroy+0x3c/0x70
Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211 required this due to the master netdev, but now
it can put all information into skb->cb and this can go.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (55 commits)
netxen: fix tx ring accounting
netxen: fix detection of cut-thru firmware mode
forcedeth: fix dma api mismatches
atm: sk_wmem_alloc initial value is one
net: correct off-by-one write allocations reports
via-velocity : fix no link detection on boot
Net / e100: Fix suspend of devices that cannot be power managed
TI DaVinci EMAC : Fix rmmod error
net: group address list and its count
ipv4: Fix fib_trie rebalancing, part 2
pkt_sched: Update drops stats in act_police
sky2: version 1.23
sky2: add GRO support
sky2: skb recycling
sky2: reduce default transmit ring
sky2: receive counter update
sky2: fix shutdown synchronization
sky2: PCI irq issues
sky2: more receive shutdown
sky2: turn off pause during shutdown
...
Manually fix trivial conflict in net/core/skbuff.c due to kmemcheck
The skb mac_header field is sometimes NULL (or ~0u) as a sentinel
value. The places where skb is expanded add an offset which would
change this flag into an invalid pointer (or offset).
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looking at the crash in log_martians(), one suspect is that the check for
mac header being set is not correct. The value of mac_header defaults to
0 on allocation, therefore skb_mac_header_was_set will always be true on
platforms using NET_SKBUFF_USES_OFFSET.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
In order to handle powersave frames properly we had needed
to pass these out to the device queues again, and introduce
the skb->requeue bit. This, however, also has unnecessary
overhead by needing to 'clean up' already tried frames, and
this clean-up code is also buggy when software encryption
is used.
Instead of sending the frames via the master netdev queue
again, simply put them into the pending queue. This also
fixes a problem where frames for that particular station
could be reordered when some were still on the software
queues and older ones are re-injected into the software
queue after them.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
On Thu, Jun 04, 2009 at 09:06:00PM +1000, Herbert Xu wrote:
>
> tun: Optimise handling of bogus gso->hdr_len
>
> As all current versions of virtio_net generate a value for the
> header length that's too small, we should optimise this so that
> we don't copy it twice. This can be done by ensuring that it is
> at least as large as the place where we'll write the checksum.
>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With this applied we can strengthen the partial checksum check:
In skb_partial_csum_set we check to see if the checksum offset
is within the packet. However, we really should check that it
is within the skb head as that's the only bit we can modify
without copying.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;
Delete skb->dst field
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch stores the two shinfo pointers in local variables
because they're used over and over again in skb_gro_receive.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reverses the direction of the frags array copy in
skb_gro_receive in order simplify the loop conditional. It
also avoids touching the first element of the original frags
array.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch stores the offset/headlen in local variables as they're
used repeatedly in skb_gro_offset.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
gcc does a poor job at generating code for the memcpy of the frags
array in skb_gro_receive, which is the primary purpose of that
function when merging frags. In particular, it can't utilise the
alignment information of the source and destination. This patch
open-codes the copy so we process words instead of bytes.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi:
skbuff: Copy csum instead of csum_start/csum_offset
It's easier to copy the u32 csum instead of its two u16
constituents.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi:
skbuff: Move new __skb_clone code into __copy_skb_header
It seems that people just keep on adding stuff to __skb_clone
instead __copy_skb_header. This is wrong as it means your brand-new
attributes won't always get copied as you intended.
This patch moves them to the right place, and adds a comment to
prevent this from happening again.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks,
Signed-off-by: David S. Miller <davem@davemloft.net>
When called with a consumed value that is less than skb_headlen(skb)
bytes into a page frag, skb_seq_read() incorrectly returns an
offset/length relative to skb->data. Ensure that data which should come
from a page frag does.
Signed-off-by: Thomas Chenault <thomas_chenault@dell.com>
Tested-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit ac45f602ee ("net: infrastructure
for hardware time stamping") added two skb initialization actions to
__alloc_skb(), which need to be added to skb_recycle_check() as well.
Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lennert Buytenhek wrote:
> Since 4fb6699481 ("net: Optimize memory
> usage when splicing from sockets.") I'm seeing this oops (e.g. in
> 2.6.30-rc3) when splicing from a TCP socket to /dev/null on a driver
> (mv643xx_eth) that uses LRO in the skb mode (lro_receive_skb) rather
> than the frag mode:
My patch incorrectly assumed skb->sk was always valid, but for
"frag_listed" skbs we can only use skb->sk of their parent.
Reported-by: Lennert Buytenhek <buytenh@wantstofly.org>
Debugged-by: Lennert Buytenhek <buytenh@wantstofly.org>
Tested-by: Lennert Buytenhek <buytenh@wantstofly.org>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Impact: clean up
Create a sub directory in include/trace called events to keep the
trace point headers in their own separate directory. Only headers that
declare trace points should be defined in this directory.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When GRO/frag_list support was added to GSO, I made an error
which broke the support for segmenting linear GSO packets (GSO
packets are normally non-linear in the payload).
These days most of these packets are constructed by the tun
driver, which prefers to allocate linear memory if possible.
This is fixed in the latest kernel, but for 2.6.29 and earlier
it is still the norm.
Therefore this bug causes failures with GSO when used with tun
in 2.6.29.
Reported-by: James Huang <jamesclhuang@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove some pointless conditionals before kfree_skb().
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A long time ago we had bugs, primarily in TCP, where we would modify
skb->truesize (for TSO queue collapsing) in ways which would corrupt
the socket memory accounting.
skb_truesize_check() was added in order to try and catch this error
more systematically.
However this debugging check has morphed into a Frankenstein of sorts
and these days it does nothing other than catch false-positives.
Signed-off-by: David S. Miller <davem@davemloft.net>
The additional per-packet information (16 bytes for time stamps, 1
byte for flags) is stored for all packets in the skb_shared_info
struct. This implementation detail is hidden from users of that
information via skb_* accessor functions. A separate struct resp.
union is used for the additional information so that it can be
stored/copied easily outside of skb_shared_info.
Compared to previous implementations (reusing the tstamp field
depending on the context, optional additional structures) this
is the simplest solution. It does not extend sk_buff itself.
TX time stamping is implemented in software if the device driver
doesn't support hardware time stamping.
The new semantic for hardware/software time stamping around
ndo_start_xmit() is based on two assumptions about existing
network device drivers which don't support hardware time
stamping and know nothing about it:
- they leave the new skb_shared_tx unmodified
- the keep the connection to the originating socket in skb->sk
alive, i.e., don't call skb_orphan()
Given that skb_shared_tx is new, the first assumption is safe.
The second is only true for some drivers. As a result, software
TX time stamping currently works with the bnx2 driver, but not
with the unmodified igb driver (the two drivers this patch series
was tested with).
Signed-off-by: Patrick Ohly <patrick.ohly@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct page walking should be done with proper accessor functions, not
directly.
With doubts from David S. Miller and Herbert Xu.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/core/skbuff.c is a hodge-podge of symbol export placement.
Some of the exports are right after the definition of the
symbol being exported, others are clumped together into a big
group at the end of the file.
Make things consistent.
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous fix ad0f990444 (gro:
Fix handling of imprecisely split packets) only fixed the case
of frags merging, frag_list merging in the same circumstances
were still broken.
In particular, the packet headers end up in the data stream.
This patch fixes this plus another issue where an imprecisely
split packet header may be read incorrectly (this is mostly
harmless since it'll simply cause the packet to not match and
be rejected for GRO).
Thanks to Emil Tantilov and Jeff Kirsher for helping to track
this down.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent fix of data corruption when splicing from sockets uses
memory very inefficiently allocating a new page to copy each chunk of
linear part of skb. This patch uses the same page until it's full
(almost) by caching the page in sk_sndmsg_page field.
With changes from David S. Miller <davem@davemloft.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
gro: Do not merge paged packets into frag_list
Bigger is not always better :)
It was easy to continue to merged packets into frag_list after the
page array is full. However, this turns out to be worse than LRO
because frag_list is a much less efficient form of storage than the
page array. So we're better off stopping the merge and starting
a new entry with an empty page array.
In future we can optimise this further by doing frag_list merging
but making sure that we continue to fill in the page array.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unfortunately simplicity isn't always the best. The fraginfo
interface turned out to be suboptimal. The problem was quite
obvious. For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.
LRO didn't have this problem because it directly read the headers
from the frags structure.
This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it. Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
It oopsd for me in skb_seq_read. addr2line said it was
linux-2.6/net/core/skbuff.c:2228, which is this line:
while (st->frag_idx < skb_shinfo(st->cur_skb)->nr_frags) {
I added some printks in there and it looks like we hit this:
} else if (st->root_skb == st->cur_skb &&
skb_shinfo(st->root_skb)->frag_list) {
st->cur_skb = skb_shinfo(st->root_skb)->frag_list;
st->frag_idx = 0;
goto next_skb;
}
Actually I did some testing and added a few printks and found that the
st->cur_skb->data was 0 and hence the ptr used by iscsi_tcp was null.
This caused the kernel panic.
if (abs_offset < block_limit) {
- *data = st->cur_skb->data + abs_offset;
+ *data = st->cur_skb->data + (abs_offset - st->stepped_offset);
I enabled the debug_tcp and with a few printks found that the code did
not go to the next_skb label and could find that the sequence being
followed was this -
It hit this if condition -
if (st->cur_skb->next) {
st->cur_skb = st->cur_skb->next;
st->frag_idx = 0;
goto next_skb;
And so, now the st pointer is shifted to the next skb whereas actually
it should have hit the second else if first since the data is in the
frag_list.
else if (st->root_skb == st->cur_skb &&
skb_shinfo(st->root_skb)->frag_list) {
st->cur_skb = skb_shinfo(st->root_skb)->frag_list;
goto next_skb;
}
Reversing the two conditions the attached patch fixes the issue for me
on top of Herbert's patches.
Signed-off-by: Shyam Iyer <shyam_iyer@dell.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The frag_list handling was broken in skb_seq_read:
1) We didn't add the stepped offset when looking at the head
are of fragments other than the first.
2) We didn't take the stepped offset away when setting the data
pointer in the head area.
3) The frag index wasn't reset.
This patch fixes both issues.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous fix to paged packets broke the merging because it
reset the skb->len before we added it to the merged packet. This
wasn't detected because it simply resulted in the truncation of
the packet while the missing bit is subsequently retransmitted.
The fix is to store skb->len before we clobber it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.
The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.
But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.
The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.
The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused. Yet, that is
all that the socket send side has at this point.
This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.
The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.
This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.
With fixes from Herbert Xu.
Tested-by: Willy Tarreau <w@1wt.eu>
Foreseen-by: Changli Gao <xiaosuo@gmail.com>
Diagnosed-by: Willy Tarreau <w@1wt.eu>
Reported-by: Willy Tarreau <w@1wt.eu>
Fixed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an skb with page frags is merged into an existing one, we
cannibalise its reference count. This is OK when the skb is
reused because we set nr_frags to zero in that case. However,
for the case where the skb is freed through kfree_skb, we didn't
clear nr_frags which causes the page to be freed prematurely.
This is fixed by moving the skb resetting into skb_gro_receive.
Reported-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows GRO to merge page frags (skb_shinfo(skb)->frags)
in one skb, rather than using the less efficient frag_list.
It also adds a new interface, napi_gro_frags to allow drivers
to inject page frags directly into the stack without allocating
an skb. This is intended to be the GRO equivalent for LRO's
lro_receive_frags interface.
The existing GSO interface can already handle page frags with
or without an appended frag_list so nothing needs to be changed
there.
The merging itself is rather simple. We store any new frag entries
after the last existing entry, without checking whether the first
new entry can be merged with the last existing entry. Making this
check would actually be easy but since no existing driver can
produce contiguous frags anyway it would just be mental masturbation.
If the total number of entries would exceed the capacity of a
single skb, we simply resort to using frag_list as we do now.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow GRO packets without frag_list at all, we need to
store the MSS in the packet itself. The obvious place is gso_size.
The only thing to watch out for is if the packet ends up not being
GRO then we need to clear gso_size before pushing the packet into
the stack.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>