In preparation for unifying the skb_frag and bio_vec, use the fine
accessors which already exist and use skb_frag_t instead of
struct skb_frag_struct.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
e1000 writes to doorbells to post transmit descriptors and fill the
receive ring. After writing descriptors to memory but before
writing to doorbells, use dma_wmb() rather than wmb(). wmb() is more
heavyweight than necessary for a device to see descriptor writes.
On x86, this avoids SFENCEs before doorbell writes in both the
Tx and Rx paths. On ARM, this converts DSB ST -> DMB OSHST.
Tested: 82576EB / x86; QEMU (qemu emulates an 8257x)
Signed-off-by: Venkatesh Srinivas <venkateshs@google.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Pull networking updates from David Miller:
"Highlights:
1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.
2) Add fib_sync_mem to control the amount of dirty memory we allow to
queue up between synchronize RCU calls, from David Ahern.
3) Make flow classifier more lockless, from Vlad Buslov.
4) Add PHY downshift support to aquantia driver, from Heiner
Kallweit.
5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
contention on SLAB spinlocks in heavy RPC workloads.
6) Partial GSO offload support in XFRM, from Boris Pismenny.
7) Add fast link down support to ethtool, from Heiner Kallweit.
8) Use siphash for IP ID generator, from Eric Dumazet.
9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
entries, from David Ahern.
10) Move skb->xmit_more into a per-cpu variable, from Florian
Westphal.
11) Improve eBPF verifier speed and increase maximum program size,
from Alexei Starovoitov.
12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
spinlocks. From Neil Brown.
13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.
14) Improve link partner cap detection in generic PHY code, from
Heiner Kallweit.
15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
Maguire.
16) Remove SKB list implementation assumptions in SCTP, your's truly.
17) Various cleanups, optimizations, and simplifications in r8169
driver. From Heiner Kallweit.
18) Add memory accounting on TX and RX path of SCTP, from Xin Long.
19) Switch PHY drivers over to use dynamic featue detection, from
Heiner Kallweit.
20) Support flow steering without masking in dpaa2-eth, from Ioana
Ciocoi.
21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
Pirko.
22) Increase the strict parsing of current and future netlink
attributes, also export such policies to userspace. From Johannes
Berg.
23) Allow DSA tag drivers to be modular, from Andrew Lunn.
24) Remove legacy DSA probing support, also from Andrew Lunn.
25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
Haabendal.
26) Add a generic tracepoint for TX queue timeouts to ease debugging,
from Cong Wang.
27) More indirect call optimizations, from Paolo Abeni"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
cxgb4: Fix error path in cxgb4_init_module
net: phy: improve pause mode reporting in phy_print_status
dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
net: macb: Change interrupt and napi enable order in open
net: ll_temac: Improve error message on error IRQ
net/sched: remove block pointer from common offload structure
net: ethernet: support of_get_mac_address new ERR_PTR error
net: usb: smsc: fix warning reported by kbuild test robot
staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
net: dsa: support of_get_mac_address new ERR_PTR error
net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
vrf: sit mtu should not be updated when vrf netdev is the link
net: dsa: Fix error cleanup path in dsa_init_module
l2tp: Fix possible NULL pointer dereference
taprio: add null check on sched_nest to avoid potential null pointer dereference
net: mvpp2: cls: fix less than zero check on a u32 variable
net_sched: sch_fq: handle non connected flows
net_sched: sch_fq: do not assume EDT packets are ordered
net: hns3: use devm_kcalloc when allocating desc_cb
net: hns3: some cleanup for struct hns3_enet_ring
...
mmiowb() is now implied by spin_unlock() on architectures that require
it, so there is no reason to call it from driver code. This patch was
generated using coccinelle:
@mmiowb@
@@
- mmiowb();
and invoked as:
$ for d in drivers include/linux/qed sound; do \
spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done
NOTE: mmiowb() has only ever guaranteed ordering in conjunction with
spin_unlock(). However, pairing each mmiowb() removal in this patch with
the corresponding call to spin_unlock() is not at all trivial, so there
is a small chance that this change may regress any drivers incorrectly
relying on mmiowb() to order MMIO writes between CPUs using lock-free
synchronisation. If you've ended up bisecting to this commit, you can
reintroduce the mmiowb() calls using wmb() instead, which should restore
the old behaviour on all architectures other than some esoteric ia64
systems.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
There are two reasons for this.
First, the xmit_more flag conceptually doesn't fit into the skb, as
xmit_more is not a property related to the skb.
Its only a hint to the driver that the stack is about to transmit another
packet immediately.
Second, it was only done this way to not have to pass another argument
to ndo_start_xmit().
We can place xmit_more in the softnet data, next to the device recursion.
The recursion counter is already written to on each transmit. The "more"
indicator is placed right next to it.
Drivers can use the netdev_xmit_more() helper instead of skb->xmit_more
to check the "more packets coming" hint.
skb->xmit_more is retained (but always 0) to not cause build breakage.
This change takes care of the simple s/skb->xmit_more/netdev_xmit_more()/
conversions. Remaining drivers are converted in the next patches.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to comments in <linux/netdevice.h> we should return either >0
or -errno from ->ndo_set_features() if changing dev->features by itself.
Return 1 in such places to notify netdev_update_features() about applied
changes in dev->features.
Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
While reviewing code, I noticed that Eric Dumazet recommends that
drivers check the return code of napi_complete_done, and use that
to decide to enable interrupts or not when exiting poll. One of
the Intel drivers was already fixed (ixgbe).
Upon looking at the Intel drivers as a whole, we are handling our
polling and NAPI exit in a few different ways based on whether we
have multiqueue and whether we have Tx cleanup included. Several
drivers had the bug of exiting NAPI with return 0, which appears
to mess up the accounting in the stack.
Consolidate all the NAPI routines to do best known way of exiting
and to just mostly look like each other.
1) check return code of napi_complete_done to control interrupt enable
2) return the actual amount of work done.
3) return budget immediately if need NAPI poll again
Tested the changes on e1000e with a high interrupt rate set, and
it shows about an 8% reduction in the CPU utilization when busy
polling because we aren't re-enabling interrupts when we're about
to be polled.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/net/ethernet/intel/e1000/e1000_main.c: In function 'e1000_watchdog':
drivers/net/ethernet/intel/e1000/e1000_main.c:2436:9: warning:
variable 'txb2b' set but not used [-Wunused-but-set-variable]
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We recently updated all our SPDX identifiers to correctly
indicate our net/ethernet/intel/* drivers were always released
and intended to be released under GPL v2, but the MODULE_LICENSE
declaration was never updated.
Fix the MODULE_LICENSE to be GPL v2, for all our drivers.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
After many years of having a ~30 line copyright and license header to our
source files, we are finally able to reduce that to one line with the
advent of the SPDX identifier.
Also caught a few files missing the SPDX license identifier, so fixed
them up.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the SPDX identifiers to all the Intel wired LAN driver files, as
outlined in Documentation/process/license-rules.rst.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a race condition that can result into the interface being
up and carrier on, but with transmits disabled in the hardware.
The bug may show up by repeatedly IFF_DOWN+IFF_UP the interface, which
allows e1000_watchdog() interleave with e1000_down().
CPU x CPU y
--------------------------------------------------------------------
e1000_down():
netif_carrier_off()
e1000_watchdog():
if (carrier == off) {
netif_carrier_on();
enable_hw_transmit();
}
disable_hw_transmit();
e1000_watchdog():
/* carrier on, do nothing */
Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
A common pattern with skb_put() is to just want to memcpy()
some data into the new space, introduce skb_put_data() for
this.
An spatch similar to the one for skb_put_zero() converts many
of the places using it:
@@
identifier p, p2;
expression len, skb, data;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_data(skb, data, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_data(skb, data, len);
)
(
p2 = (t2)p;
-memcpy(p2, data, len);
|
-memcpy(p, data, len);
)
@@
type t, t2;
identifier p, p2;
expression skb, data;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_data(skb, data, sizeof(t));
)
(
p2 = (t2)p;
-memcpy(p2, data, sizeof(*p));
|
-memcpy(p, data, sizeof(*p));
)
@@
expression skb, len, data;
@@
-memcpy(skb_put(skb, len), data, len);
+skb_put_data(skb, data, len);
(again, manually post-processed to retain some comments)
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e1000_get_stats() just returns dev->stats so we can leave it
out altogether and let dev_get_stats() do the job.
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
In commit 02cea39586 ("genirq: Provide disable_hardirq()")
Peter introduced disable_hardirq() for netpoll, but it is forgotten
to use it for e1000.
This patch changes disable_irq() to disable_hardirq() for e1000.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
e100: min_mtu 68, max_mtu 1500
- remove e100_change_mtu entirely, is identical to old eth_change_mtu,
and no longer serves a purpose. No need to set min_mtu or max_mtu
explicitly, as ether_setup() will already set them to 68 and 1500.
e1000: min_mtu 46, max_mtu 16110
e1000e: min_mtu 68, max_mtu varies based on adapter
fm10k: min_mtu 68, max_mtu 15342
- remove fm10k_change_mtu entirely, does nothing now
i40e: min_mtu 68, max_mtu 9706
i40evf: min_mtu 68, max_mtu 9706
igb: min_mtu 68, max_mtu 9216
- There are two different "max" frame sizes claimed and both checked in
the driver, the larger value wasn't relevant though, so I've set max_mtu
to the smaller of the two values here to retain identical behavior.
igbvf: min_mtu 68, max_mtu 9216
- Same issue as igb duplicated
ixgb: min_mtu 68, max_mtu 16114
- Also remove pointless old == new check, as that's done in dev_set_mtu
ixgbe: min_mtu 68, max_mtu 9710
ixgbevf: min_mtu 68, max_mtu dependent on hardware/firmware
- Some hw can only handle up to max_mtu 1504 on a vf, others 9710
CC: netdev@vger.kernel.org
CC: intel-wired-lan@lists.osuosl.org
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling dev_close() causes IFF_UP to be cleared which will remove the
interfaces routes and some addresses. That's probably not what the user
intended when running the offline selftest. Besides this does not happen
if the interface is brought down before the test, so the current
behaviour is inconsistent.
Instead call the net_device_ops ndo_stop function directly and avoid
touching IFF_UP at all.
Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The 82544 has code that adds one additional descriptor per data buffer.
However we weren't taking that into account when determining the descriptors
needed for the next transmit at the end of the xmit_frame path.
This change takes that into account by doubling the number of descriptors
needed for the 82544 so that we can avoid a potential issue where we could
hang the Tx ring by loading frames with xmit_more enabled and then stopping
the ring without writing the tail.
In addition it adds a few more descriptors to account for some additional
workarounds that have been added over time.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The current code path is capable of grossly overestimating the number of
descriptors needed to transmit a new frame. This specifically occurs if
the skb contains a number of 4K pages. The issue is that the logic for
determining the descriptors needed is ((S) >> (X)) + 1. When X is 12 it
means that we were indicating that we required 2 descriptors for each 4K
page when we only needed one.
This change corrects this by instead adding (1 << (X)) - 1 to the S value
instead of adding 1 after the fact. This way we get an accurate descriptor
needed count as we are essentially doing a DIV_ROUNDUP().
Reported-by: Ivan Suzdal <isuzdal@mirantis.com>
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Use 'That' to replace 'The' so that the comment would make sense.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The checking logic needed some clean-up work, so we rewrite it by
checking for break first. With that change in place, we can even move
the second check for goto statement outside of the loop.
As this is merely a cleanup, no functional change is involved. The
questionable 'tmp != 0xFF' is intentionally left alone.
Mark Rustad and Alexander Duyck contributed to this patch.
CC: Mark Rustad <mark.d.rustad@intel.com>
CC: Alex Duyck <aduyck@mirantis.com>
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Janusz Wolak <januszvdm@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
e1000_clean_tx_irq cleans buffers and sets tx_ring->next_to_clean,
then e1000_xmit_frame reuses the cleaned buffers. But there are no
memory barriers when buffers gets recycled, so the recycled buffers
can be corrupted.
Use smp_store_release to update tx_ring->next_to_clean and
smp_load_acquire to read tx_ring->next_to_clean to properly
hand off buffers from e1000_clean_tx_irq to e1000_xmit_frame.
The data race was found with KernelThreadSanitizer (KTSAN).
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
As per Eric Dumazet's previous patches:
(see commit (24d2e4a507) - tg3: use napi_complete_done())
Quoting verbatim:
Using napi_complete_done() instead of napi_complete() allows
us to use /sys/class/net/ethX/gro_flush_timeout
GRO layer can aggregate more packets if the flush is delayed a bit,
without having to set too big coalescing parameters that impact
latencies.
</end quote>
Tested
configuration: low latency via ethtool -C ethx adaptive-rx off
rx-usecs 10 adaptive-tx off tx-usecs 15
workload: streaming rx using netperf TCP_MAERTS
igb:
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo
...
Interim result: 941.48 10^6bits/s over 1.000 seconds ending at 1440193171.589
Alignment Offset Bytes Bytes Recvs Bytes Sends
Local Remote Local Remote Xfered Per Per
Recv Send Recv Send Recv (avg) Send (avg)
8 8 0 0 1176930056 1475.36 797726 16384.00 71905
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET : demo
...
Interim result: 941.49 10^6bits/s over 0.997 seconds ending at 1440193142.763
Alignment Offset Bytes Bytes Recvs Bytes Sends
Local Remote Local Remote Xfered Per Per
Recv Send Recv Send Recv (avg) Send (avg)
8 8 0 0 1175182320 50476.00 23282 16384.00 71816
i40e:
Hard to test because the traffic is incoming so fast (24Gb/s) that GRO
always receives 87kB, even at the highest interrupt rate.
Other drivers were only compile tested.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change replaces calls to rmb with dma_rmb in the case where we want to
order all follow-on descriptor reads after the check for the descriptor
status bit.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To test a checkpatch spelling patch, I ran codespell against
drivers/net/ethernet/.
$ git ls-files drivers/net/ethernet/ | \
while read file ; do \
codespell -w $file; \
done
I removed a false positive in e1000_hw.h
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a race condition between e1000_change_mtu's cleanups and
netpoll, when we change the MTU across jumbo size:
Changing MTU frees all the rx buffers:
e1000_change_mtu -> e1000_down -> e1000_clean_all_rx_rings ->
e1000_clean_rx_ring
Then, close to the end of e1000_change_mtu:
pr_info -> ... -> netpoll_poll_dev -> e1000_clean ->
e1000_clean_rx_irq -> e1000_alloc_rx_buffers -> e1000_alloc_frag
And when we come back to do the rest of the MTU change:
e1000_up -> e1000_configure -> e1000_configure_rx ->
e1000_alloc_jumbo_rx_buffers
alloc_jumbo finds the buffers already != NULL, since data (shared with
page in e1000_rx_buffer->rxbuf) has been re-alloc'd, but it's garbage,
or at least not what is expected when in jumbo state.
This results in an unusable adapter (packets don't get through), and a
NULL pointer dereference on the next call to e1000_clean_rx_ring
(other mtu change, link down, shutdown):
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff81194d6e>] put_compound_page+0x7e/0x330
[...]
Call Trace:
[<ffffffff81195445>] put_page+0x55/0x60
[<ffffffff815d9f44>] e1000_clean_rx_ring+0x134/0x200
[<ffffffff815da055>] e1000_clean_all_rx_rings+0x45/0x60
[<ffffffff815df5e0>] e1000_down+0x1c0/0x1d0
[<ffffffff811e2260>] ? deactivate_slab+0x7f0/0x840
[<ffffffff815e21bc>] e1000_change_mtu+0xdc/0x170
[<ffffffff81647050>] dev_set_mtu+0xa0/0x140
[<ffffffff81664218>] do_setlink+0x218/0xac0
[<ffffffff814459e9>] ? nla_parse+0xb9/0x120
[<ffffffff816652d0>] rtnl_newlink+0x6d0/0x890
[<ffffffff8104f000>] ? kvm_clock_read+0x20/0x40
[<ffffffff810a2068>] ? sched_clock_cpu+0xa8/0x100
[<ffffffff81663802>] rtnetlink_rcv_msg+0x92/0x260
By setting the allocator to a dummy version, netpoll can't mess up our
rx buffers. The allocator is set back to a sane value in
e1000_configure_rx.
Fixes: edbbb3ca10 ("e1000: implement jumbo receive with partial descriptors")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
When bringing down an interface netif_carrier_off() should be
one the first things we do, since this will prevent the stack
from queuing more packets to this interface.
This operation is very fast, and should make the device behave
much nicer when trying to bring down an interface under load.
Also, this would Do The Right Thing (TM) if this device has some
sort of fail-over teaming and redirect traffic to the other IF.
Move netif_carrier_off as early as possible.
Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Don't update Tx tail descriptor if we queue hasn't been stopped and
we know at least one more skb will be sent right away.
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The same macros are used for rx as well. So rename it.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change replaces calls to netdev_alloc_skb_ip_align with
napi_alloc_skb. The advantage of napi_alloc_skb is currently the fact that
the page allocation doesn't make use of any irq disable calls.
There are few spots where I couldn't replace the calls as the buffer
allocation routine is called as a part of init which is outside of the
softirq context.
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Update the Intel Ethernet drivers to use eth_skb_pad() and skb_put_padto
instead of doing their own implementations of the function.
Also this cleans up two other spots where skb_pad was called but the length
and tail pointers were being manipulated directly instead of just having
the padding length added via __skb_put.
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
VMWare's e1000 implementation does not seem to support unicast filtering.
This can be observed by configuring a macvlan interface on eth0 in a VM in
VMWare Fusion 5.0.5, and trying to use that interface instead of eth0.
Tested on 3.16.
Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
and remove *page, its only used for Rx.
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
e1000 uses the same metadata struct for Rx and Tx. But Tx and Rx have
different requirements.
For Rx, we only need to store a buffer and a DMA address.
Follow-up patch will remove skb for Rx, bringing rx_buffer_info down
to 16 bytes on x86_64.
[ buffer_info is 48 bytes ]
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Currently we unmap the DMA range, then copy to new skb.
Change this so we can keep the mapping in case the data is copied.
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Its the same in both handlers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
... and make it static.
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This device claims TSO and checksum support for vlans. It also
allows a user to control vlan acceleration offloading. As such,
it is possible to turn off vlan acceleration and configure a vlan
which will continue to support TSO.
In such situation the packet passed down the the device will contain
a vlan header and skb->protocol will be set to ETH_P_8021Q.
The device assumes that skb->protocol contains network protocol
value and uses that value to set up TSO and checksum information.
This will results in corrupted frames sent on the wire.
This patch extract the protocol value correctly and corrects TSO
for non-accelerated traffic.
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Bruce Allan <bruce.w.allan@intel.com>
CC: Carolyn Wyborny <carolyn.wyborny@intel.com>
CC: Don Skidmore <donald.c.skidmore@intel.com>
CC: Greg Rose <gregory.v.rose@intel.com>
CC: Alex Duyck <alexander.h.duyck@intel.com>
CC: John Ronciak <john.ronciak@intel.com>
CC: Mitch Williams <mitch.a.williams@intel.com>
CC: Linux NICS <linux.nics@intel.com>
CC: e1000-devel@lists.sourceforge.net
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We should prefer `struct pci_device_id` over `DEFINE_PCI_DEVICE_TABLE` to
meet kernel coding style guidelines. This issue was reported by checkpatch.
A simplified version of the semantic patch that makes this change is as
follows (http://coccinelle.lip6.fr/):
// <smpl>
@@
identifier i;
declarer name DEFINE_PCI_DEVICE_TABLE;
initializer z;
@@
- DEFINE_PCI_DEVICE_TABLE(i)
+ const struct pci_device_id i[]
= z;
// </smpl>
[bhelgaas: add semantic patch]
Signed-off-by: Benoit Taine <benoit.taine@lip6.fr>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
There is no case skb->len would be 0 or 'negative'.
Remove the check.
Signed-off-by: Yongjian Xu <xuyongjiande@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
On e1000_down(), we should ensure every asynchronous work is canceled
before proceeding. Since the watchdog_task can schedule other works
apart from itself, it should be stopped first, but currently it is
stopped after the reset_task. This can result in the following race
leading to the reset_task running after the module unload:
e1000_down_and_stop(): e1000_watchdog():
---------------------- -----------------
cancel_work_sync(reset_task)
schedule_work(reset_task)
cancel_delayed_work_sync(watchdog_task)
The patch moves cancel_delayed_work_sync(watchdog_task) at the beginning
of e1000_down_and_stop() thus ensuring the race is impossible.
Cc: Tushar Dave <tushar.n.dave@intel.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The patch fixes the following lockdep warning, which is 100%
reproducible on network restart:
======================================================
[ INFO: possible circular locking dependency detected ]
3.12.0+ #47 Tainted: GF
-------------------------------------------------------
kworker/1:1/27 is trying to acquire lock:
((&(&adapter->watchdog_task)->work)){+.+...}, at: [<ffffffff8108a5b0>] flush_work+0x0/0x70
but task is already holding lock:
(&adapter->mutex){+.+...}, at: [<ffffffffa0177c0a>] e1000_reset_task+0x4a/0xa0 [e1000]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&adapter->mutex){+.+...}:
[<ffffffff810bdb5d>] lock_acquire+0x9d/0x120
[<ffffffff816b8cbc>] mutex_lock_nested+0x4c/0x390
[<ffffffffa017233d>] e1000_watchdog+0x7d/0x5b0 [e1000]
[<ffffffff8108b972>] process_one_work+0x1d2/0x510
[<ffffffff8108ca80>] worker_thread+0x120/0x3a0
[<ffffffff81092c1e>] kthread+0xee/0x110
[<ffffffff816c3d7c>] ret_from_fork+0x7c/0xb0
-> #0 ((&(&adapter->watchdog_task)->work)){+.+...}:
[<ffffffff810bd9c0>] __lock_acquire+0x1710/0x1810
[<ffffffff810bdb5d>] lock_acquire+0x9d/0x120
[<ffffffff8108a5eb>] flush_work+0x3b/0x70
[<ffffffff8108b5d8>] __cancel_work_timer+0x98/0x140
[<ffffffff8108b693>] cancel_delayed_work_sync+0x13/0x20
[<ffffffffa0170cec>] e1000_down_and_stop+0x3c/0x60 [e1000]
[<ffffffffa01775b1>] e1000_down+0x131/0x220 [e1000]
[<ffffffffa0177c12>] e1000_reset_task+0x52/0xa0 [e1000]
[<ffffffff8108b972>] process_one_work+0x1d2/0x510
[<ffffffff8108ca80>] worker_thread+0x120/0x3a0
[<ffffffff81092c1e>] kthread+0xee/0x110
[<ffffffff816c3d7c>] ret_from_fork+0x7c/0xb0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&adapter->mutex);
lock((&(&adapter->watchdog_task)->work));
lock(&adapter->mutex);
lock((&(&adapter->watchdog_task)->work));
*** DEADLOCK ***
3 locks held by kworker/1:1/27:
#0: (events){.+.+.+}, at: [<ffffffff8108b906>] process_one_work+0x166/0x510
#1: ((&adapter->reset_task)){+.+...}, at: [<ffffffff8108b906>] process_one_work+0x166/0x510
#2: (&adapter->mutex){+.+...}, at: [<ffffffffa0177c0a>] e1000_reset_task+0x4a/0xa0 [e1000]
stack backtrace:
CPU: 1 PID: 27 Comm: kworker/1:1 Tainted: GF 3.12.0+ #47
Hardware name: System manufacturer System Product Name/P5B-VM SE, BIOS 0501 05/31/2007
Workqueue: events e1000_reset_task [e1000]
ffffffff820f6000 ffff88007b9dba98 ffffffff816b54a2 0000000000000002
ffffffff820f5e50 ffff88007b9dbae8 ffffffff810ba936 ffff88007b9dbac8
ffff88007b9dbb48 ffff88007b9d8f00 ffff88007b9d8780 ffff88007b9d8f00
Call Trace:
[<ffffffff816b54a2>] dump_stack+0x49/0x5f
[<ffffffff810ba936>] print_circular_bug+0x216/0x310
[<ffffffff810bd9c0>] __lock_acquire+0x1710/0x1810
[<ffffffff8108a5b0>] ? __flush_work+0x250/0x250
[<ffffffff810bdb5d>] lock_acquire+0x9d/0x120
[<ffffffff8108a5b0>] ? __flush_work+0x250/0x250
[<ffffffff8108a5eb>] flush_work+0x3b/0x70
[<ffffffff8108a5b0>] ? __flush_work+0x250/0x250
[<ffffffff8108b5d8>] __cancel_work_timer+0x98/0x140
[<ffffffff8108b693>] cancel_delayed_work_sync+0x13/0x20
[<ffffffffa0170cec>] e1000_down_and_stop+0x3c/0x60 [e1000]
[<ffffffffa01775b1>] e1000_down+0x131/0x220 [e1000]
[<ffffffffa0177c12>] e1000_reset_task+0x52/0xa0 [e1000]
[<ffffffff8108b972>] process_one_work+0x1d2/0x510
[<ffffffff8108b906>] ? process_one_work+0x166/0x510
[<ffffffff8108ca80>] worker_thread+0x120/0x3a0
[<ffffffff8108c960>] ? manage_workers+0x2c0/0x2c0
[<ffffffff81092c1e>] kthread+0xee/0x110
[<ffffffff81092b30>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816c3d7c>] ret_from_fork+0x7c/0xb0
[<ffffffff81092b30>] ? __init_kthread_worker+0x70/0x70
== The issue background ==
The problem occurs, because e1000_down(), which is called under
adapter->mutex by e1000_reset_task(), tries to synchronously cancel
e1000 auxiliary works (reset_task, watchdog_task, phy_info_task,
fifo_stall_task), which take adapter->mutex in their handlers. So the
question is what does adapter->mutex protect there?
The adapter->mutex was introduced by commit 0ef4ee ("e1000: convert to
private mutex from rtnl") as a replacement for rtnl_lock() taken in the
asynchronous handlers. It targeted on fixing a similar lockdep warning
issued when e1000_down() was called under rtnl_lock(), and it fixed it,
but unfortunately it introduced the lockdep warning described above.
Anyway, that said the source of this bug is that the asynchronous works
were made to take rtnl_lock() some time ago, so let's look deeper and
find why it was added there.
The rtnl_lock() was added to asynchronous handlers by commit 338c15
("e1000: fix occasional panic on unload") in order to prevent
asynchronous handlers from execution after the module is unloaded
(e1000_down() is called) as it follows from the comment to the commit:
> Net drivers in general have an issue where timers fired
> by mod_timer or work threads with schedule_work are running
> outside of the rtnl_lock.
>
> With no other lock protection these routines are vulnerable
> to races with driver unload or reset paths.
>
> The longer term solution to this might be a redesign with
> safer locks being taken in the driver to guarantee no
> reentrance, but for now a safe and effective fix is
> to take the rtnl_lock in these routines.
I'm not sure if this locking scheme fixed the problem or just made it
unlikely, although I incline to the latter. Anyway, this was long time
ago when e1000 auxiliary works were implemented as timers scheduling
real work handlers in their routines. The e1000_down() function only
canceled the timers, but left the real handlers running if they were
running, which could result in work execution after module unload.
Today, the e1000 driver uses sane delayed works instead of the pair
timer+work to implement its delayed asynchronous handlers, and the
e1000_down() synchronously cancels all the works so that the problem
that commit 338c15 tried to cope with disappeared, and we don't need any
locks in the handlers any more. Moreover, any locking there can
potentially result in a deadlock.
So, this patch reverts commits 0ef4ee and 338c15.
Fixes: 0ef4eedc2e ("e1000: convert to private mutex from rtnl")
Fixes: 338c15e470 ("e1000: fix occasional panic on unload")
Cc: Tushar Dave <tushar.n.dave@intel.com>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This change is based on a similar change made to e1000e support in
commit bb9e44d0d0 ("e1000e: prevent oops when adapter is being closed
and reset simultaneously"). The same issue has also been observed
on the older e1000 cards.
Here, we have increased the RESET_COUNT value to 50 because there are too
many accesses to e1000 nic on stress tests to e1000 nic, it is not enough
to set RESET_COUT 25. Experimentation has shown that it is enough to set
RESET_COUNT 50.
Signed-off-by: yzhu1 <yanjun.zhu@windriver.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>