Replace two switch statements enumerating all valid ptp classes with an if
statement matching for not PTP_CLASS_NONE.
Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The if_lock()/if_unlock() in next_to_run() adds a significant
overhead, because its called for every packet in busy loop of
pktgen_thread_worker(). (Thomas Graf originally pointed me
at this lock problem).
Removing these two "LOCK" operations should in theory save us approx
16ns (8ns x 2), as illustrated below we do save 16ns when removing
the locks and introducing RCU protection.
Performance data with CLONE_SKB==100000, TX-size=512, rx-usecs=30:
(single CPU performance, ixgbe 10Gbit/s, E5-2630)
* Prev : 5684009 pps --> 175.93ns (1/5684009*10^9)
* RCU-fix: 6272204 pps --> 159.43ns (1/6272204*10^9)
* Diff : +588195 pps --> -16.50ns
To understand this RCU patch, I describe the pktgen thread model
below.
In pktgen there is several kernel threads, but there is only one CPU
running each kernel thread. Communication with the kernel threads are
done through some thread control flags. This allow the thread to
change data structures at a know synchronization point, see main
thread func pktgen_thread_worker().
Userspace changes are communicated through proc-file writes. There
are three types of changes, general control changes "pgctrl"
(func:pgctrl_write), thread changes "kpktgend_X"
(func:pktgen_thread_write), and interface config changes "etcX@N"
(func:pktgen_if_write).
Userspace "pgctrl" and "thread" changes are synchronized via the mutex
pktgen_thread_lock, thus only a single userspace instance can run.
The mutex is taken while the packet generator is running, by pgctrl
"start". Thus e.g. "add_device" cannot be invoked when pktgen is
running/started.
All "pgctrl" and all "thread" changes, except thread "add_device",
communicate via the thread control flags. The main problem is the
exception "add_device", that modifies threads "if_list" directly.
Fortunately "add_device" cannot be invoked while pktgen is running.
But there exists a race between "rem_device_all" and "add_device"
(which normally don't occur, because "rem_device_all" waits 125ms
before returning). Background'ing "rem_device_all" and running
"add_device" immediately allow the race to occur.
The race affects the threads (list of devices) "if_list". The if_lock
is used for protecting this "if_list". Other readers are given
lock-free access to the list under RCU read sections.
Note, interface config changes (via proc) can occur while pktgen is
running, which worries me a bit. I'm assuming proc_remove() takes
appropriate locks, to assure no writers exists after proc_remove()
finish.
I've been running a script exercising the race condition (leading me
to fix the proc_remove order), without any issues. The script also
exercises concurrent proc writes, while the interface config is
getting removed.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid calling set_current_state() inside the busy-loop in
pktgen_thread_worker(). In case of pkt_dev->delay, then it is still
used/enabled in pktgen_xmit() via the spin() call.
The set_current_state(TASK_INTERRUPTIBLE) uses a xchg, which implicit
is LOCK prefixed. I've measured the asm LOCK operation to take approx
8ns on this E5-2630 CPU. Performance increase corrolate with this
measurement.
Performance data with CLONE_SKB==100000, rx-usecs=30:
(single CPU performance, ixgbe 10Gbit/s, E5-2630)
* Prev: 5454050 pps --> 183.35ns (1/5454050*10^9)
* Now: 5684009 pps --> 175.93ns (1/5684009*10^9)
* Diff: +229959 pps --> -7.42ns
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far, it is assumed that ops->setup is filled up. But there might be
case that ops might make sense even without ->setup. In that case,
forbid to newlink and dellink.
This allows to register simple rtnl link ops containing only ->kind.
That allows consistent way of passing device kind (either device-kind or
slave-kind) to userspace.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 371121057607e3127e19b3fa094330181b5b031e("net:
QDISC_STATE_RUNNING dont need atomic bit ops") the
__QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
but the old names existing in comment are not replaced with
the new name completely.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI target fixes from Nicholas Bellinger:
"Mostly minor fixes this time around. The highlights include:
- iscsi-target CHAP authentication fixes to enforce explicit key
values (Tejas Vaykole + rahul.rane)
- fix a long-standing OOPs in target-core when a alua configfs
attribute is accessed after port symlink has been removed.
(Sebastian Herbszt)
- fix a v3.10.y iscsi-target regression causing the login reject
status class/detail to be ignored (Christoph Vu-Brugier)
- fix a v3.10.y iscsi-target regression to avoid rejecting an
existing ITT during Data-Out when data-direction is wrong (Santosh
Kulkarni + Arshad Hussain)
- fix a iscsi-target related shutdown deadlock on UP kernels (Mikulas
Patocka)
- fix a v3.16-rc1 build issue with vhost-scsi + !CONFIG_NET (MST)"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
iscsi-target: fix iscsit_del_np deadlock on unload
iovec: move memcpy_from/toiovecend to lib/iovec.c
iscsi-target: Avoid rejecting incorrect ITT for Data-Out
tcm_loop: Fix memory leak in tcm_loop_submission_work error path
iscsi-target: Explicily clear login response PDU in exception path
target: Fix left-over se_lun->lun_sep pointer OOPs
iscsi-target; Enforce 1024 byte maximum for CHAP_C key value
iscsi-target: Convert chap_server_compute_md5 to use kstrtoul
ERROR: "memcpy_fromiovecend" [drivers/vhost/vhost_scsi.ko] undefined!
commit 9f977ef7b6
vhost-scsi: Include prot_bytes into expected data transfer length
in target-pending makes drivers/vhost/scsi.c call memcpy_fromiovecend().
This function is not available when CONFIG_NET is not enabled.
socket.h already includes uio.h, so no callers need updating.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Dave Jones reported that a crash is occurring in
csum_partial
tcp_gso_segment
inet_gso_segment
? update_dl_migration
skb_mac_gso_segment
__skb_gso_segment
dev_hard_start_xmit
sch_direct_xmit
__dev_queue_xmit
? dev_hard_start_xmit
dev_queue_xmit
ip_finish_output
? ip_output
ip_output
ip_forward_finish
ip_forward
ip_rcv_finish
ip_rcv
__netif_receive_skb_core
? __netif_receive_skb_core
? trace_hardirqs_on
__netif_receive_skb
netif_receive_skb_internal
napi_gro_complete
? napi_gro_complete
dev_gro_receive
? dev_gro_receive
napi_gro_receive
It looks like a likely culprit is that SKB_GSO_CB()->csum_start is
not set correctly when doing non-scatter gather. We are using
offset as opposed to doffset.
Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 7e2b10c1e5 ("net: Support for multiple checksums with gso")
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IP route cache had been removed in linux-3.6, we broke assumption
that dst entries were all freed after rcu grace period. DST_NOCACHE
dst were supposed to be freed from dst_release(). But it appears
we want to keep such dst around, either in UDP sockets or tunnels.
In sk_dst_get() we need to make sure dst refcount is not 0
before incrementing it, or else we might end up freeing a dst
twice.
DST_NOCACHE set on a dst does not mean this dst can not be attached
to a socket or a tunnel.
Then, before actual freeing, we need to observe a rcu grace period
to make sure all other cpus can catch the fact the dst is no longer
usable.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dormando <dormando@rydia.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use kcalloc/kmalloc_array to make it clear we're allocating arrays. No
integer overflow can actually happen here, since len/flen is guaranteed
to be less than BPF_MAXINSNS (4096). However, this changed makes sure
we're not going to get one if BPF_MAXINSNS were ever increased.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the order of the parameters to sk_unattached_filter_create() in
the kerneldoc to reflect the order they appear in the actual function.
This fix is only cosmetic, in the generated doc they still appear in the
correct order without the fix.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It seems overkill to use vmalloc() for typical listeners with less than
2048 hash buckets. Try kmalloc() and fallback to vmalloc() to reduce TLB
pressure.
Use kvfree() helper as it is now available.
Use ilog2() instead of a loop.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_flow_dissect() dissects only transport header type in ip_proto. It dose not
give any information about IPv4 or IPv6.
This patch adds new member, n_proto, to struct flow_keys. Which records the
IP layer type. i.e IPv4 or IPv6.
This can be used in netdev->ndo_rx_flow_steer driver function to dissect flow.
Adding new member to flow_keys increases the struct size by around 4 bytes.
This causes BUILD_BUG_ON(sizeof(qcb->data) < sz); to fail in
qdisc_cb_private_validate()
So increase data size by 4
Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The original checks (via sk_chk_filter) for instruction count uses ">",
not ">=", so changing this in sk_convert_filter has the potential to break
existing seccomp filters that used exactly BPF_MAXINSNS many instructions.
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # v3.15+
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Return the actual error code if call kset_create_and_add() failed
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __dev_open(), it already calls dev_set_rx_mode().
and dev_set_rx_mode() has no effect for a net device which does not have
IFF_UP flag set.
So the call of dev_set_rx_mode() is duplicate in __dev_change_flags().
Signed-off-by: Weiping Pan <panweiping3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Geert reported issues regarding checksum complete and UDP.
The logic introduced in commit 7e3cead517
("net: Save software checksum complete") is not correct.
This patch:
1) Restores code in __skb_checksum_complete_header except for setting
CHECKSUM_UNNECESSARY. This function may be calculating checksum on
something less than skb->len.
2) Adds saving checksum to __skb_checksum_complete. The full packet
checksum 0..skb->len is calculated without adding in pseudo header.
This value is saved in skb->csum and then the pseudo header is added
to that to derive the checksum for validation.
3) In both __skb_checksum_complete_header and __skb_checksum_complete,
set skb->csum_valid to whether checksum of zero was computed. This
allows skb_csum_unnecessary to return true without changing to
CHECKSUM_UNNECESSARY which was done previously.
4) Copy new csum related bits in __copy_skb_header.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.
2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
Benniston.
3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
Mork.
4) BPF now has a "random" opcode, from Chema Gonzalez.
5) Add more BPF documentation and improve test framework, from Daniel
Borkmann.
6) Support TCP fastopen over ipv6, from Daniel Lee.
7) Add software TSO helper functions and use them to support software
TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.
8) Support software TSO in fec driver too, from Nimrod Andy.
9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.
10) Handle broadcasts more gracefully over macvlan when there are large
numbers of interfaces configured, from Herbert Xu.
11) Allow more control over fwmark used for non-socket based responses,
from Lorenzo Colitti.
12) Do TCP congestion window limiting based upon measurements, from Neal
Cardwell.
13) Support busy polling in SCTP, from Neal Horman.
14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.
15) Bridge promisc mode handling improvements from Vlad Yasevich.
16) Don't use inetpeer entries to implement ID generation any more, it
performs poorly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
tcp: fixing TLP's FIN recovery
net: fec: Add software TSO support
net: fec: Add Scatter/gather support
net: fec: Increase buffer descriptor entry number
net: fec: Factorize feature setting
net: fec: Enable IP header hardware checksum
net: fec: Factorize the .xmit transmit function
bridge: fix compile error when compiling without IPv6 support
bridge: fix smatch warning / potential null pointer dereference
via-rhine: fix full-duplex with autoneg disable
bnx2x: Enlarge the dorq threshold for VFs
bnx2x: Check for UNDI in uncommon branch
bnx2x: Fix 1G-baseT link
bnx2x: Fix link for KR with swapped polarity lane
sctp: Fix sk_ack_backlog wrap-around problem
net/core: Add VF link state control policy
net/fsl: xgmac_mdio is dependent on OF_MDIO
net/fsl: Make xgmac_mdio read error message useful
net_sched: drr: warn when qdisc is not work conserving
...
When running RHEL6 userspace on a current upstream kernel, "ip link"
fails to show VF information.
The reason is a kernel<->userspace API change introduced by commit
88c5b5ce5c ("rtnetlink: Call nlmsg_parse() with correct header length"),
after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
in the netlink request.
iproute2 adjusted for the API change in its commit 63338dca4513
("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").
The problem has been noticed before:
http://marc.info/?l=linux-netdev&m=136692296022182&w=2
(Subject: Re: getting VF link info seems to be broken in 3.9-rc8)
We can do better than tell those with old userspace to upgrade. We can
recognize the old iproute2 in the kernel by checking the netlink message
length. Even when including the IFLA_EXT_MASK attribute, its netlink
message is shorter than struct ifinfomsg.
With this patch "ip link" shows VF information in both old and new
iproute2 versions.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
net/core/rtnetlink.c
net/core/skbuff.c
Both conflicts were very simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1d8faf48c7 (net/core: Add VF link state control) added VF link state
control to the netlink VF nested structure, but failed to add a proper entry
for the new structure into the VF policy table. Add the missing entry so
the table and the actual data copied into the netlink nested struct are in
sync.
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In skb_checksum complete, if we need to compute the checksum for the
packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
Subsequent checksum verification can use this.
Also, added csum_complete_sw flag to distinguish between software and
hardware generated checksum complete, we should always be able to trust
the software computation.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are several instances where a pskb_copy or __pskb_copy is
immediately followed by an skb_clone.
Add a couple of new functions to allow the copy skb to be allocated
from the fclone cache and thus speed up subsequent skb_clone calls.
Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Antonio Quartulli <antonio@meshcoding.com>
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Arvid Brodin <arvid.brodin@alten.se>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Lauro Ramos Venancio <lauro.venancio@openbossa.org>
Cc: Aloisio Almeida Jr <aloisio.almeida@openbossa.org>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Allan Stephens <allan.stephens@windriver.com>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fix compiler warning on 32-bit architectures:
net/core/filter.c: In function '__sk_run_filter':
net/core/filter.c:540:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
net/core/filter.c:550:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
net/core/filter.c:560:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a kernel BUG_ON in skb_segment. It is hit when
testing two VMs on openvswitch with one VM acting as VXLAN gateway.
During VXLAN packet GSO, skb_segment is called with skb->data
pointing to inner TCP payload. skb_segment calls skb_network_protocol
to retrieve the inner protocol. skb_network_protocol actually expects
skb->data to point to MAC and it calls pskb_may_pull with ETH_HLEN.
This ends up pulling in ETH_HLEN data from header tail. As a result,
pskb_trim logic is skipped and BUG_ON is hit later.
Move skb_push in front of skb_network_protocol so that skb->data
lines up properly.
kernel BUG at net/core/skbuff.c:2999!
Call Trace:
[<ffffffff816ac412>] tcp_gso_segment+0x122/0x410
[<ffffffff816bc74c>] inet_gso_segment+0x13c/0x390
[<ffffffff8164b39b>] skb_mac_gso_segment+0x9b/0x170
[<ffffffff816b3658>] skb_udp_tunnel_segment+0xd8/0x390
[<ffffffff816b3c00>] udp4_ufo_fragment+0x120/0x140
[<ffffffff816bc74c>] inet_gso_segment+0x13c/0x390
[<ffffffff8109d742>] ? default_wake_function+0x12/0x20
[<ffffffff8164b39b>] skb_mac_gso_segment+0x9b/0x170
[<ffffffff8164b4d0>] __skb_gso_segment+0x60/0xc0
[<ffffffff8164b6b3>] dev_hard_start_xmit+0x183/0x550
[<ffffffff8166c91e>] sch_direct_xmit+0xfe/0x1d0
[<ffffffff8164bc94>] __dev_queue_xmit+0x214/0x4f0
[<ffffffff8164bf90>] dev_queue_xmit+0x10/0x20
[<ffffffff81687edb>] ip_finish_output+0x66b/0x890
[<ffffffff81688a58>] ip_output+0x58/0x90
[<ffffffff816c628f>] ? fib_table_lookup+0x29f/0x350
[<ffffffff816881c9>] ip_local_out_sk+0x39/0x50
[<ffffffff816cbfad>] iptunnel_xmit+0x10d/0x130
[<ffffffffa0212200>] vxlan_xmit_skb+0x1d0/0x330 [vxlan]
[<ffffffffa02a3919>] vxlan_tnl_send+0x129/0x1a0 [openvswitch]
[<ffffffffa02a2cd6>] ovs_vport_send+0x26/0xa0 [openvswitch]
[<ffffffffa029931e>] do_output+0x2e/0x50 [openvswitch]
Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The macro 'A' used in internal BPF interpreter:
#define A regs[insn->a_reg]
was easily confused with the name of classic BPF register 'A', since
'A' would mean two different things depending on context.
This patch is trying to clean up the naming and clarify its usage in the
following way:
- A and X are names of two classic BPF registers
- BPF_REG_A denotes internal BPF register R0 used to map classic register A
in internal BPF programs generated from classic
- BPF_REG_X denotes internal BPF register R7 used to map classic register X
in internal BPF programs generated from classic
- internal BPF instruction format:
struct sock_filter_int {
__u8 code; /* opcode */
__u8 dst_reg:4; /* dest register */
__u8 src_reg:4; /* source register */
__s16 off; /* signed offset */
__s32 imm; /* signed immediate constant */
};
- BPF_X/BPF_K is 1 bit used to encode source operand of instruction
In classic:
BPF_X - means use register X as source operand
BPF_K - means use 32-bit immediate as source operand
In internal:
BPF_X - means use 'src_reg' register as source operand
BPF_K - means use 32-bit immediate as source operand
Suggested-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup updates from Tejun Heo:
"A lot of activities on cgroup side. Heavy restructuring including
locking simplification took place to improve the code base and enable
implementation of the unified hierarchy, which currently exists behind
a __DEVEL__ mount option. The core support is mostly complete but
individual controllers need further work. To explain the design and
rationales of the the unified hierarchy
Documentation/cgroups/unified-hierarchy.txt
is added.
Another notable change is css (cgroup_subsys_state - what each
controller uses to identify and interact with a cgroup) iteration
update. This is part of continuing updates on css object lifetime and
visibility. cgroup started with reference count draining on removal
way back and is now reaching a point where csses behave and are
iterated like normal refcnted objects albeit with some complexities to
allow distinguishing the state where they're being deleted. The css
iteration update isn't taken advantage of yet but is planned to be
used to simplify memcg significantly"
* 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
cgroup: disallow disabled controllers on the default hierarchy
cgroup: don't destroy the default root
cgroup: disallow debug controller on the default hierarchy
cgroup: clean up MAINTAINERS entries
cgroup: implement css_tryget()
device_cgroup: use css_has_online_children() instead of has_children()
cgroup: convert cgroup_has_live_children() into css_has_online_children()
cgroup: use CSS_ONLINE instead of CGRP_DEAD
cgroup: iterate cgroup_subsys_states directly
cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
cgroup: move cgroup->serial_nr into cgroup_subsys_state
cgroup: link all cgroup_subsys_states in their sibling lists
cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
cgroup: remove cgroup->parent
device_cgroup: remove direct access to cgroup->children
memcg: update memcg_has_children() to use css_next_child()
memcg: remove tasks/children test from mem_cgroup_force_empty()
cgroup: remove css_parent()
cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
cgroup: use cgroup->self.refcnt for cgroup refcnting
...
unregister_netdevice_many() API is error prone and we had too
many bugs because of dangling LIST_HEAD on stacks.
See commit f87e6f4793 ("net: dont leave active on stack LIST_HEAD")
In fact, instead of making sure no caller leaves an active list_head,
just force a list_del() in the callee. No one seems to need to access
the list after unregister_netdevice_many()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.
* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...
Conflicts:
drivers/net/xen-netback/netback.c
net/core/filter.c
A filter bug fix overlapped some cleanups and a conversion
over to some new insn generation macros.
A xen-netback bug fix overlapped the addition of multi-queue
support.
Signed-off-by: David S. Miller <davem@davemloft.net>
BPF classic->internal converter broke SKF_AD_PKTTYPE extension, since
pkt_type_offset() was failing to find skb->pkt_type field which is defined as:
__u8 pkt_type:3,
fclone:2,
ipvs_property:1,
peeked:1,
nf_trace:1;
Fix it by searching for 3 most significant bits and shift them by 5 at run-time
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an MPLS packet requires segmentation then use mpls_features
to determine if the software implementation should be used.
As no driver advertises MPLS GSO segmentation this will always be
the case.
I had not noticed that this was necessary before as software MPLS GSO
segmentation was already being used in my test environment. I believe that
the reason for that is the skbs in question always had fragments and the
driver I used does not advertise NETIF_F_FRAGLIST (which seems to be the
case for most drivers). Thus software segmentation was activated by
skb_gso_ok().
This introduces the overhead of an extra call to skb_network_protocol()
in the case where where CONFIG_NET_MPLS_GSO is set and
skb->ip_summed == CHECKSUM_NONE.
Thanks to Jesse Gross for prompting me to investigate this.
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is available since v3.15-rc5.
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When creating a GSO packet segment we may need to set more than
one checksum in the packet (for instance a TCP checksum and
UDP checksum for VXLAN encapsulation). To be efficient, we want
to do checksum calculation for any part of the packet at most once.
This patch adds csum_start offset to skb_gso_cb. This tracks the
starting offset for skb->csum which is initially set in skb_segment.
When a protocol needs to compute a transport checksum it calls
gso_make_checksum which computes the checksum value from the start
of transport header to csum_start and then adds in skb->csum to get
the full checksum. skb->csum and csum_start are then updated to reflect
the checksum of the resultant packet starting from the transport header.
This patch also adds a flag to skbuff, encap_hdr_csum, which is set
in *gso_segment fucntions to indicate that a tunnel protocol needs
checksum calculation
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
include/net/inetpeer.h
net/ipv6/output_core.c
Changes in net were fixing bugs in code removed in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
When we jump to free_pcpu on failure in alloc_netdev_mqs()
rx and tx queues are not yet allocated, so no need to free them.
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible that ->newlink() fails before registering
the device, in this case we should just free it, it's
safe to call free_netdev().
Fixes: commit 0e0eee2465 (net: correct error path in rtnl_newlink())
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings says:
====================
Pull request: Fixes for new ethtool RSS commands
This addresses several problems I previously identified with the new
ETHTOOL_{G,S}RSSH commands:
1. Missing validation of reserved parameters
2. Vague documentation
3. Use of unnamed magic number
4. No consolidation with existing driver operations
I don't currently have access to suitable network hardware, but have
tested these changes with a dummy driver that can support various
combinations of operations and sizes, together with (a) Debian's ethtool
3.13 (b) ethtool 3.14 with the submitted patch to use ETHTOOL_{G,S}RSSH
and minor adjustment for fixes 1 and 3.
v2: Update RSS operations in vmxnet3 too
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ETHTOOL_{G,S}RXFHINDIR and ETHTOOL_{G,S}RSSH should work for drivers
regardless of whether they expose the hash key, unless you try to
set a hash key for a driver that doesn't expose it.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
__sk_prepare_filter() was reworked in commit bd4cf0ed3 (net: filter:
rework/optimize internal BPF interpreter's instruction set) so that it should
have uncharged memory once things went wrong. However that work isn't complete.
Error is handled only in __sk_migrate_filter() while memory can still leak in
the error path right after sk_chk_filter().
Fixes: bd4cf0ed33 ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ideally, we would need to generate IP ID using a per destination IP
generator.
linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.
1) each inet_peer struct consumes 192 bytes
2) inetpeer cache uses a binary tree of inet_peer structs,
with a nominal size of ~66000 elements under load.
3) lookups in this tree are hitting a lot of cache lines, as tree depth
is about 20.
4) If server deals with many tcp flows, we have a high probability of
not finding the inet_peer, allocating a fresh one, inserting it in
the tree with same initial ip_id_count, (cf secure_ip_id())
5) We garbage collect inet_peer aggressively.
IP ID generation do not have to be 'perfect'
Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.
We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.
ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)
secure_ip_id() and secure_ipv6_id() no longer are needed.
Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change provides a function to be used in order to break the
ndo_set_rx_mode call into a set of address add and remove calls. The code
is based on the implementation of dev_uc_sync/dev_mc_sync. Since they
essentially do the same thing but with only one dev I simply named my
functions __dev_uc_sync/__dev_mc_sync.
I also implemented an unsync version of the functions as well to allow for
cleanup on close.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9739eef13c ("net: filter: make BPF conversion more readable")
started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
macros from classic BPF.
However, quite some statements in the filter conversion functions
remained in the old style which gives a mixture of block macros and
non block macros in the code. This patch makes the block macros itself
more readable by using explicit member initialization, and converts
the remaining ones where possible to remain in a more consistent state.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch finally allows us to get rid of the BPF_S_* enum.
Currently, the code performs unnecessary encode and decode
workarounds in seccomp and filter migration itself when a filter
is being attached in order to overcome BPF_S_* encoding which
is not used anymore by the new interpreter resp. JIT compilers.
Keeping it around would mean that also in future we would need
to extend and maintain this enum and related encoders/decoders.
We can get rid of all that and save us these operations during
filter attaching. Naturally, also JIT compilers need to be updated
by this.
Before JIT conversion is being done, each compiler checks if A
is being loaded at startup to obtain information if it needs to
emit instructions to clear A first. Since BPF extensions are a
subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
for extensions can be removed at that point. To ease and minimalize
code changes in the classic JITs, we have introduced bpf_anc_helper().
Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
unfortunately didn't have access, but changes are analogous to
the rest.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mircea Gherzan <mgherzan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Acked-by: Chema Gonzalez <chemag@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After 1e785f48d2 ("net: Start with correct mac_len in
skb_network_protocol") skb->mac_len is used as a start of the
calculation in skb_network_protocol() but that is not always correct. If
skb->protocol == 8021Q/AD, usually the vlan header is already inserted
in the skb (i.e. vlan reorder hdr == 0). Usually when the packet enters
dev_hard_xmit it has mac_len == 0 so we take 2 bytes from the
destination mac address (skb->data + VLAN_HLEN) as a type in
skb_network_protocol() and return vlan_depth == 4. In the case where TSO is
off, then the mac_len is set but it's == 18 (ETH_HLEN + VLAN_HLEN), so
skb_network_protocol() returns a type from inside the packet and
offset == 22. Also make vlan_depth unsigned as suggested before.
As suggested by Eric Dumazet, move the while() loop in the if() so we
can avoid additional testing in fast path.
Here are few netperf tests + debug printk's to illustrate:
cat netperf.tso-on.reorder-on.bugged
- Vlan -> device (reorder on, default, this case is okay)
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 7111.54
[ 81.605435] skb->len 65226 skb->gso_size 1448 skb->proto 0x800
skb->mac_len 0 vlan_depth 0 type 0x800
- Vlan -> device (reorder off, bad)
cat netperf.tso-on.reorder-off.bugged
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 241.35
[ 204.578332] skb->len 1518 skb->gso_size 0 skb->proto 0x8100
skb->mac_len 0 vlan_depth 4 type 0x5301
0x5301 are the last two bytes of the destination mac.
And if we stop TSO, we may get even the following:
[ 83.343156] skb->len 2966 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 18 vlan_depth 22 type 0xb84
Because mac_len already accounts for VLAN_HLEN.
After the fix:
cat netperf.tso-on.reorder-off.fixed
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.3.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.01 5001.46
[ 81.888489] skb->len 65230 skb->gso_size 1448 skb->proto 0x8100
skb->mac_len 0 vlan_depth 18 type 0x800
CC: Vlad Yasevich <vyasevic@redhat.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Borkman <dborkman@redhat.com>
CC: David S. Miller <davem@davemloft.net>
Fixes:1e785f48d29a ("net: Start with correct mac_len in
skb_network_protocol")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/bonding/bond_alb.c
drivers/net/ethernet/altera/altera_msgdma.c
drivers/net/ethernet/altera/altera_sgdma.c
net/ipv6/xfrm6_output.c
Several cases of overlapping changes.
The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.
In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.
Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_unattached_filter_create() API is used by BPF filters that
are not directly attached or related to sockets, and are used in
team, ptp, xt_bpf, cls_bpf, etc. As such all users do their own
internal managment of obtaining filter blocks and thus already
have them in kernel memory and set up before calling into
sk_unattached_filter_create(). As a result, due to __user annotation
in sock_fprog, sparse triggers false positives (incorrect type in
assignment [different address space]) when filters are set up before
passing them to sk_unattached_filter_create(). Therefore, let
sk_unattached_filter_create() API use sock_fprog_kern to overcome
this issue.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lets get rid of this macro. After commit 5bcfedf06f ("net: filter:
simplify label names from jump-table"), labels have become more
readable due to omission of BPF_ prefix but at the same time more
generic, so that things like `git grep -n` would not find them. As
a middle path, lets get rid of the DL macro as it's not strictly
needed and would otherwise just hide the full name.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define separate fields in the sock structure for configuring disabling
checksums in both TX and RX-- sk_no_check_tx and sk_no_check_rx.
The SO_NO_CHECK socket option only affects sk_no_check_tx. Also,
removed UDP_CSUM_* defines since they are no longer necessary.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
o min_tx_rate puts lower limit on the VF bandwidth. VF is guaranteed
to have a bandwidth of at least this value.
max_tx_rate puts cap on the VF bandwidth. VF can have a bandwidth
of up to this value.
o A new handler set_vf_rate for attr IFLA_VF_RATE has been introduced
which takes 4 arguments:
netdev, VF number, min_tx_rate, max_tx_rate
o ndo_set_vf_rate replaces ndo_set_vf_tx_rate handler.
o Drivers that currently implement ndo_set_vf_tx_rate should now call
ndo_set_vf_rate instead and reject attempt to set a minimum bandwidth
greater than 0 for IFLA_VF_TX_RATE when IFLA_VF_RATE is not yet
implemented by driver.
o If user enters only one of either min_tx_rate or max_tx_rate, then,
userland should read back the other value from driver and set both
for IFLA_VF_RATE.
Drivers that have not yet implemented IFLA_VF_RATE should always
return min_tx_rate as 0 when read from ip tool.
o If both IFLA_VF_TX_RATE and IFLA_VF_RATE options are specified, then
IFLA_VF_RATE should override.
o Idea is to have consistent display of rate values to user.
o Usage example: -
./ip link set p4p1 vf 0 rate 900
./ip link show p4p1
32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 900 (Mbps), max_tx_rate 900Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a
./ip link set p4p1 vf 0 max_tx_rate 300 min_tx_rate 200
./ip link show p4p1
32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5a, tx rate 300 (Mbps), max_tx_rate 300Mbps,
min_tx_rate 200Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a
./ip link set p4p1 vf 0 max_tx_rate 600 rate 300
./ip link show p4p1
32: p4p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT qlen 1000
link/ether 00:0e:1e:08:b0:f brd ff:ff:ff:ff:ff:ff
vf 0 MAC 3e:a0:ca:bd:ae:5, tx rate 600 (Mbps), max_tx_rate 600Mbps,
min_tx_rate 200Mbps
vf 1 MAC f6:c6:7c:3f:3d:6c
vf 2 MAC 56:32:43:98:d7:71
vf 3 MAC d6:be:c3:b5:85:ff
vf 4 MAC ee:a9:9a:1e:19:14
vf 5 MAC 4a:d0:4c:07:52:18
vf 6 MAC 3a:76:44:93:62:f9
vf 7 MAC 82:e9:e7:e3:15:1a
Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although the implementation probably needs a lot of work, this initial API
allows to implement software TSO in mvneta and mv643xx_eth drivers in a not
so intrusive way.
Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel API for classic BPF socket filters is:
sk_unattached_filter_create() - validate classic BPF, convert, JIT
SK_RUN_FILTER() - run it
sk_unattached_filter_destroy() - destroy socket filter
Cleanup internal BPF kernel API as following:
sk_filter_select_runtime() - final step of internal BPF creation.
Try to JIT internal BPF program, if JIT is not available select interpreter
SK_RUN_FILTER() - run it
sk_filter_free() - free internal BPF program
Disallow direct calls to BPF interpreter. Execution of the BPF program should
be done with SK_RUN_FILTER() macro.
Example of internal BPF create, run, destroy:
struct sk_filter *fp;
fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
fp->len = prog_len;
sk_filter_select_runtime(fp);
SK_RUN_FILTER(fp, ctx);
sk_filter_free(fp);
Sockets, seccomp, testsuite, tracing are using different ways to populate
sk_filter, so first steps of program creation are not common.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This would be a no-op, so there is no reason to request it.
This also allows conversion of the current implementations of
ethtool_ops::{get,set}_rxfh_indir to ethtool_ops::{get,set}_rxfh
with no change other than their parameters.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We usually allocate special values of u32 fields starting from the top
down, so also change the value to 0xffffffff. As these operations
haven't been included in a stable release yet, it's not too late to
change.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We must return -EFAULT immediately rather than continuing into
the loop.
Similarly, we may as well return -EINVAL directly.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Prior to commit fbd929f2dc
bonding: support QinQ for bond arp interval
the arp monitoring code allowed for proper detection of devices
stacked on top of vlans. Since the above commit, the
code can still detect a device stacked on top of single
vlan, but not a device stacked on top of Q-in-Q configuration.
The search will only set the inner vlan tag if the route
device is the vlan device. However, this is not always the
case, as it is possible to extend the stacked configuration.
With this patch it is possible to provision devices on
top Q-in-Q vlan configuration that should be used as
a source of ARP monitoring information.
For example:
ip link add link bond0 vlan10 type vlan proto 802.1q id 10
ip link add link vlan10 vlan100 type vlan proto 802.1q id 100
ip link add link vlan100 type macvlan
Note: This patch limites the number of stacked VLANs to 2,
just like before. The original, however had another issue
in that if we had more then 2 levels of VLANs, we would end
up generating incorrectly tagged traffic. This is no longer
possible.
Fixes: fbd929f2dc (bonding: support QinQ for bond arp interval)
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@redhat.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Ding Tianhong <dingtianhong@huawei.com>
CC: Patric McHardy <kaber@trash.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit dc8eaaa006.
vlan: Fix lockdep warning when vlan dev handle notification
Instead we use the new new API to find the lock subclass of
our vlan device. This way we can support configurations where
vlans are interspersed with other devices:
bond -> vlan -> macvlan -> vlan
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiple devices in the kernel can be stacked/nested and they
need to know their nesting level for the purposes of lockdep.
This patch provides a generic function that determines a nesting
level of a particular device by its type (ex: vlan, macvlan, etc).
We only care about nesting of the same type of devices.
For example:
eth0 <- vlan0.10 <- macvlan0 <- vlan1.20
The nesting level of vlan1.20 would be 1, since there is another vlan
in the stack under it.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting from linux-3.13, GRO attempts to build full size skbs.
Problem is the commit assumed one particular field in skb->cb[]
was clean, but it is not the case on some stacked devices.
Timo reported a crash in case traffic is decrypted before
reaching a GRE device.
Fix this by initializing NAPI_GRO_CB(skb)->last at the right place,
this also removes one conditional.
Thanks a lot to Timo for providing full reports and bisecting this.
Fixes: 8a29111c7c ("net: gro: allow to build full sized skb")
Bisected-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup in general is moving towards using cgroup_subsys_state as the
fundamental structural component and css_parent() was introduced to
convert from using cgroup->parent to css->parent. It was quite some
time ago and we're moving forward with making css more prominent.
This patch drops the trivial wrapper css_parent() and let the users
dereference css->parent. While at it, explicitly mark fields of css
which are public and immutable.
v2: New usage from device_cgroup.c converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Maps all internal BPF instructions into x86_64 instructions.
This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
sysctl net.core.bpf_jit_enable is reused as on/off switch.
Performance:
1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
No performance difference is observed for filters that were JIT-able before
Example assembler code for BPF filter "tcpdump port 22"
original BPF -> old JIT: original BPF -> internal BPF -> new JIT:
0: push %rbp 0: push %rbp
1: mov %rsp,%rbp 1: mov %rsp,%rbp
4: sub $0x60,%rsp 4: sub $0x228,%rsp
8: mov %rbx,-0x8(%rbp) b: mov %rbx,-0x228(%rbp) // prologue
12: mov %r13,-0x220(%rbp)
19: mov %r14,-0x218(%rbp)
20: mov %r15,-0x210(%rbp)
27: xor %eax,%eax // clear A
c: xor %ebx,%ebx 29: xor %r13,%r13 // clear X
e: mov 0x68(%rdi),%r9d 2c: mov 0x68(%rdi),%r9d
12: sub 0x6c(%rdi),%r9d 30: sub 0x6c(%rdi),%r9d
16: mov 0xd8(%rdi),%r8 34: mov 0xd8(%rdi),%r10
3b: mov %rdi,%rbx
1d: mov $0xc,%esi 3e: mov $0xc,%esi
22: callq 0xffffffffe1021e15 43: callq 0xffffffffe102bd75
27: cmp $0x86dd,%eax 48: cmp $0x86dd,%rax
2c: jne 0x0000000000000069 4f: jne 0x000000000000009a
2e: mov $0x14,%esi 51: mov $0x14,%esi
33: callq 0xffffffffe1021e31 56: callq 0xffffffffe102bd91
38: cmp $0x84,%eax 5b: cmp $0x84,%rax
3d: je 0x0000000000000049 62: je 0x0000000000000074
3f: cmp $0x6,%eax 64: cmp $0x6,%rax
42: je 0x0000000000000049 68: je 0x0000000000000074
44: cmp $0x11,%eax 6a: cmp $0x11,%rax
47: jne 0x00000000000000c6 6e: jne 0x0000000000000117
49: mov $0x36,%esi 74: mov $0x36,%esi
4e: callq 0xffffffffe1021e15 79: callq 0xffffffffe102bd75
53: cmp $0x16,%eax 7e: cmp $0x16,%rax
56: je 0x00000000000000bf 82: je 0x0000000000000110
58: mov $0x38,%esi 88: mov $0x38,%esi
5d: callq 0xffffffffe1021e15 8d: callq 0xffffffffe102bd75
62: cmp $0x16,%eax 92: cmp $0x16,%rax
65: je 0x00000000000000bf 96: je 0x0000000000000110
67: jmp 0x00000000000000c6 98: jmp 0x0000000000000117
69: cmp $0x800,%eax 9a: cmp $0x800,%rax
6e: jne 0x00000000000000c6 a1: jne 0x0000000000000117
70: mov $0x17,%esi a3: mov $0x17,%esi
75: callq 0xffffffffe1021e31 a8: callq 0xffffffffe102bd91
7a: cmp $0x84,%eax ad: cmp $0x84,%rax
7f: je 0x000000000000008b b4: je 0x00000000000000c2
81: cmp $0x6,%eax b6: cmp $0x6,%rax
84: je 0x000000000000008b ba: je 0x00000000000000c2
86: cmp $0x11,%eax bc: cmp $0x11,%rax
89: jne 0x00000000000000c6 c0: jne 0x0000000000000117
8b: mov $0x14,%esi c2: mov $0x14,%esi
90: callq 0xffffffffe1021e15 c7: callq 0xffffffffe102bd75
95: test $0x1fff,%ax cc: test $0x1fff,%rax
99: jne 0x00000000000000c6 d3: jne 0x0000000000000117
d5: mov %rax,%r14
9b: mov $0xe,%esi d8: mov $0xe,%esi
a0: callq 0xffffffffe1021e44 dd: callq 0xffffffffe102bd91 // MSH
e2: and $0xf,%eax
e5: shl $0x2,%eax
e8: mov %rax,%r13
eb: mov %r14,%rax
ee: mov %r13,%rsi
a5: lea 0xe(%rbx),%esi f1: add $0xe,%esi
a8: callq 0xffffffffe1021e0d f4: callq 0xffffffffe102bd6d
ad: cmp $0x16,%eax f9: cmp $0x16,%rax
b0: je 0x00000000000000bf fd: je 0x0000000000000110
ff: mov %r13,%rsi
b2: lea 0x10(%rbx),%esi 102: add $0x10,%esi
b5: callq 0xffffffffe1021e0d 105: callq 0xffffffffe102bd6d
ba: cmp $0x16,%eax 10a: cmp $0x16,%rax
bd: jne 0x00000000000000c6 10e: jne 0x0000000000000117
bf: mov $0xffff,%eax 110: mov $0xffff,%eax
c4: jmp 0x00000000000000c8 115: jmp 0x000000000000011c
c6: xor %eax,%eax 117: mov $0x0,%eax
c8: mov -0x8(%rbp),%rbx 11c: mov -0x228(%rbp),%rbx // epilogue
cc: leaveq 123: mov -0x220(%rbp),%r13
cd: retq 12a: mov -0x218(%rbp),%r14
131: mov -0x210(%rbp),%r15
138: leaveq
139: retq
On fully cached SKBs both JITed functions take 12 nsec to execute.
BPF interpreter executes the program in 30 nsec.
The difference in generated assembler is due to the following:
Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
inside bpf_jit.S.
New JIT removes the helper and does it explicitly, so ldx_msh cost
is the same for both JITs, but generated code looks longer.
New JIT has 4 registers to save, so prologue/epilogue are larger,
but the cost is within noise on x64.
Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
New JIT clears %rax unconditionally.
2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
extensions. New JIT supports all BPF extensions.
Performance of such filters improves 2-4 times depending on a filter.
The longer the filter the higher performance gain.
Synthetic benchmarks with many ancillary loads see 20x speedup
which seems to be the maximum gain from JIT
Notes:
. net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
and can be used to see generated assembler
. there are two jit_compile() functions and code flow for classic filters is:
sk_attach_filter() - load classic BPF
bpf_jit_compile() - try to JIT from classic BPF
sk_convert_filter() - convert classic to internal
bpf_int_jit_compile() - JIT from internal BPF
seccomp and tracing filters will just call bpf_int_jit_compile()
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From: Cong Wang <cwang@twopensource.com>
commit 50624c934d (net: Delay default_device_exit_batch until no
devices are unregistering) introduced rtnl_lock_unregistering() for
default_device_exit_batch(). Same race could happen we when rmmod a driver
which calls rtnl_link_unregister() as we call dev->destructor without rtnl
lock.
For long term, I think we should clean up the mess of netdev_run_todo()
and net namespce exit code.
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_get_random_once depends on the static keys infrastructure to patch up
the branch to the slow path during boot. This was realized by abusing the
static keys api and defining a new initializer to not enable the call
site while still indicating that the branch point should get patched
up. This was needed to have the fast path considered likely by gcc.
The static key initialization during boot up normally walks through all
the registered keys and either patches in ideal nops or enables the jump
site but omitted that step on x86 if ideal nops where already placed at
static_key branch points. Thus net_get_random_once branches not always
became active.
This patch switches net_get_random_once to the ordinary static_key
api and thus places the kernel fast path in the - by gcc considered -
unlikely path. Microbenchmarks on Intel and AMD x86-64 showed that
the unlikely path actually beats the likely path in terms of cycle cost
and that different nop patterns did not make much difference, thus this
switch should not be noticeable.
Fixes: a48e42920f ("net: introduce new macro net_get_random_once")
Reported-by: Tuomas Räsänen <tuomasjjrasanen@tjjr.fi>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_unattached_filter_create() will copy the filter's instructions so we
don't need to have the master copy hanging around after initialization.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not collide with the x86-64 PTRACE user API namespace.
net/core/filter.c:57:0: warning: "R8" redefined [enabled by default]
arch/x86/include/uapi/asm/ptrace-abi.h:38:0: note: this is the location of the previous definition
Fix by adding a BPF_ prefix to the register macros.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 7e98056964("ipv6: router reachability probing"), a router falls
into NUD_FAILED will be probed.
Now if function rt6_select() selects a router which neighbour state is NUD_FAILED,
and at the same time function rt6_probe() changes the neighbour state to NUD_PROBE,
then function dst_neigh_output() can directly send packets, but actually the
neighbour still is unreachable. If we set nud_state to NUD_INCOMPLETE instead
NUD_PROBE, packets will not be sent out until the neihbour is reachable.
In addition, because the route should be probes with a single NS, so we must
set neigh->probes to neigh_max_probes(), then the neigh timer timeout and function
neigh_timer_handler() will not send other NS Messages.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all cftype->write_string() users to the new cftype->write()
which maps directly to kernfs write operation and has full access to
kernfs and cgroup contexts. The conversions are mostly mechanical.
* @css and @cft are accessed using of_css() and of_cft() accessors
respectively instead of being specified as arguments.
* Should return @nbytes on success instead of 0.
* @buf is not trimmed automatically. Trim if necessary. Note that
blkcg and netprio don't need this as the parsers already handle
whitespaces.
cftype->write_string() has no user left after the conversions and
removed.
While at it, remove unnecessary local variable @p in
cgroup_subtree_control_write() and stale comment about
CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
This patch doesn't introduce any visible behavior changes.
v2: netprio was missing from conversion. Converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Aristeu Rozanski <arozansk@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
As suggested by several people, rename local_df to ignore_df,
since it means "ignore df bit if it is set".
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/altera/altera_sgdma.c
net/netlink/af_netlink.c
net/sched/cls_api.c
net/sched/sch_api.c
The netlink conflict dealt with moving to netlink_capable() and
netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations
in non-init namespaces. These were simple transformations from
netlink_capable to netlink_ns_capable.
The Altera driver conflict was simply code removal overlapping some
void pointer cast cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce BPF helper macros to define instructions
(similar to old BPF_STMT/BPF_JUMP macros)
Use them while converting classic BPF to internal
and in BPF testsuite later.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit d206940319,
there are no more callers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes ordering of rtnl notifications during unregister_netdevice
by moving RTM_DELLINK notification to until after ndo_uninit.
The problem was seen with unregistering bond netdevices.
bond ndo_uninit callback generates a few RTM_NEWLINK notifications for
NETDEV_CHANGEADDR and NETDEV_FEAT_CHANGE. This is seen mostly when the
bond is deleted with slaves still enslaved to the bond.
During unregister netdevice (rollback_registered_many to be specific)
bond ndo_uninit is called after RTM_DELLINK notification goes out.
This results in userspace seeing RTM_DELLINK followed by a couple of
RTM_NEWLINK's.
In userspace problem was seen with libnl. libnl cache deletes the bond
when it sees RTM_DELLINK and re-adds the bond with the following
RTM_NEWLINK. Resulting in a stale bond entry in libnl cache when the kernel
has already deleted the bond.
This patch has been tested for bond, bridges and vlan devices.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This contains only some minor misc cleanpus. We can spare us the
extra variable declaration in __skb_get_pay_offset(), the cast in
__get_random_u32() is rather unnecessary and in __sk_migrate_realloc()
we can remove the memcpy() and do a direct assignment of the structs.
Latter was suggested by Fengguang Wu found with coccinelle. Also,
remaining pointer casts of long should be unsigned long instead.
Suggested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code is a bit hard to parse on which registers can be used,
how they are mapped and all play together. It makes much more sense to
define this a bit more clearly so that the code is a bit more intuitive.
This patch cleans this up, and makes naming a bit more consistent among
the code. This also allows for moving some of the defines into the header
file. Clearing of A and X registers in __sk_run_filter() do not get a
particular register name assigned as they have not an 'official' function,
but rather just result from the concrete initial mapping of old BPF
programs. Since for BPF helper functions for BPF_CALL we already use
small letters, so be consistent here as well. No functional changes.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch simplifies label naming for the BPF jump-table.
When we define labels via DL(), we just concatenate/textify
the combination of instruction opcode which consists of the
class, subclass, word size, target register and so on. Each
time we leave BPF_ prefix intact, so that e.g. the preprocessor
generates a label BPF_ALU_BPF_ADD_BPF_X for DL(BPF_ALU, BPF_ADD,
BPF_X) whereas a label name of ALU_ADD_X is much more easy
to grasp. Pure cleanup only.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 3de0b59239 ("ethtool: Support for configurable RSS hash
key") introduced a new function ethtool_copy_validate_indir() with
full iteration of the loop to validate the ring indices, which could
be an overkill. To minimize the impact, we ought to exit the loop as
soon as the invalid index occurs for the very first time. The
remaining loop simply doesn't serve any more purpose.
Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Venkata Duvvuru <VenkatKumar.Duvvuru@Emulex.Com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do not initialize net_kill_list twice.
list_replace_init() already takes care of initializing net_kill_list.
We don't need to initialize it with LIST_HEAD() beforehand.
Signed-off-by: xiao jin <jin.xiao@intel.com>
Reviewed-by: David Cohen <david.a.cohen@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since 115c9b8192 (rtnetlink: Fix problem with
buffer allocation), RTM_NEWLINK messages only contain the IFLA_VFINFO_LIST
attribute if they were solicited by a GETLINK message containing an
IFLA_EXT_MASK attribute with the RTEXT_FILTER_VF flag.
That was done because some user programs broke when they received more data
than expected - because IFLA_VFINFO_LIST contains information for each VF
it can become large if there are many VFs.
However, the IFLA_VF_PORTS attribute, supplied for devices which implement
ndo_get_vf_port (currently the 'enic' driver only), has the same problem.
It supplies per-VF information and can therefore become large, but it is
not currently conditional on the IFLA_EXT_MASK value.
Worse, it interacts badly with the existing EXT_MASK handling. When
IFLA_EXT_MASK is not supplied, the buffer for netlink replies is fixed at
NLMSG_GOODSIZE. If the information for IFLA_VF_PORTS exceeds this, then
rtnl_fill_ifinfo() returns -EMSGSIZE on the first message in a packet.
netlink_dump() will misinterpret this as having finished the listing and
omit data for this interface and all subsequent ones. That can cause
getifaddrs(3) to enter an infinite loop.
This patch addresses the problem by only supplying IFLA_VF_PORTS when
IFLA_EXT_MASK is supplied with the RTEXT_FILTER_VF flag set.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without IFLA_EXT_MASK specified, the information reported for a single
interface in response to RTM_GETLINK is expected to fit within a netlink
packet of NLMSG_GOODSIZE.
If it doesn't, however, things will go badly wrong, When listing all
interfaces, netlink_dump() will incorrectly treat -EMSGSIZE on the first
message in a packet as the end of the listing and omit information for
that interface and all subsequent ones. This can cause getifaddrs(3) to
enter an infinite loop.
This patch won't fix the problem, but it will WARN_ON() making it easier to
track down what's going wrong.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.
To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_net_capable - The common case, operations that are safe in a network namespace.
sk_capable - Operations that are not known to be safe in a network namespace
sk_ns_capable - The general case for special cases.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The permission check in sock_diag_put_filterinfo is wrong, and it is so removed
from it's sources it is not clear why it is wrong. Move the computation
into packet_diag_dump and pass a bool of the result into sock_diag_filterinfo.
This does not yet correct the capability check but instead simply moves it to make
it clear what is going on.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/intel/igb/e1000_mac.c
net/core/filter.c
Both conflicts were simple overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
exisiting BPF verifier allows uninitialized access to registers,
'ret A' is considered to be a valid filter.
So initialize A and X to zero to prevent leaking kernel memory
In the future BPF verifier will be rejecting such filters
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a new ancillary load (bpf call in eBPF parlance) that produces
a 32-bit random number. We are implementing it as an ancillary load
(instead of an ISA opcode) because (a) it is simpler, (b) allows easy
JITing, and (c) seems more in line with generic ISAs that do not have
"get a random number" as a instruction, but as an OS call.
The main use for this ancillary load is to perform random packet sampling.
Signed-off-by: Chema Gonzalez <chema@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ethtool patch primarily copies the ioctl command data structures
from/to the User space and invokes the driver hook.
Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The caller needs capabilities on the namespace being queried, not on
their own namespace. This is a security bug, although it likely has
only a minor impact.
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the helper __dev_forward_skb which is identical to
dev_forward_skb except that it doesn't actually inject the skb into
the stack. This is useful where we wish to have finer control over
how the packet is injected, e.g., via netif_rx_ni or netif_receive_skb.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I open the LOCKDEP config and run these steps:
modprobe 8021q
vconfig add eth2 20
vconfig add eth2.20 30
ifconfig eth2 xx.xx.xx.xx
then the Call Trace happened:
[32524.386288] =============================================
[32524.386293] [ INFO: possible recursive locking detected ]
[32524.386298] 3.14.0-rc2-0.7-default+ #35 Tainted: G O
[32524.386302] ---------------------------------------------
[32524.386306] ifconfig/3103 is trying to acquire lock:
[32524.386310] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0
[32524.386326]
[32524.386326] but task is already holding lock:
[32524.386330] (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40
[32524.386341]
[32524.386341] other info that might help us debug this:
[32524.386345] Possible unsafe locking scenario:
[32524.386345]
[32524.386350] CPU0
[32524.386352] ----
[32524.386354] lock(&vlan_netdev_addr_lock_key/1);
[32524.386359] lock(&vlan_netdev_addr_lock_key/1);
[32524.386364]
[32524.386364] *** DEADLOCK ***
[32524.386364]
[32524.386368] May be due to missing lock nesting notation
[32524.386368]
[32524.386373] 2 locks held by ifconfig/3103:
[32524.386376] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81431d42>] rtnl_lock+0x12/0x20
[32524.386387] #1: (&vlan_netdev_addr_lock_key/1){+.....}, at: [<ffffffff8141af83>] dev_set_rx_mode+0x23/0x40
[32524.386398]
[32524.386398] stack backtrace:
[32524.386403] CPU: 1 PID: 3103 Comm: ifconfig Tainted: G O 3.14.0-rc2-0.7-default+ #35
[32524.386409] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[32524.386414] ffffffff81ffae40 ffff8800d9625ae8 ffffffff814f68a2 ffff8800d9625bc8
[32524.386421] ffffffff810a35fb ffff8800d8a8d9d0 00000000d9625b28 ffff8800d8a8e5d0
[32524.386428] 000003cc00000000 0000000000000002 ffff8800d8a8e5f8 0000000000000000
[32524.386435] Call Trace:
[32524.386441] [<ffffffff814f68a2>] dump_stack+0x6a/0x78
[32524.386448] [<ffffffff810a35fb>] __lock_acquire+0x7ab/0x1940
[32524.386454] [<ffffffff810a323a>] ? __lock_acquire+0x3ea/0x1940
[32524.386459] [<ffffffff810a4874>] lock_acquire+0xe4/0x110
[32524.386464] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0
[32524.386471] [<ffffffff814fc07a>] _raw_spin_lock_nested+0x2a/0x40
[32524.386476] [<ffffffff814275f4>] ? dev_mc_sync+0x64/0xb0
[32524.386481] [<ffffffff814275f4>] dev_mc_sync+0x64/0xb0
[32524.386489] [<ffffffffa0500cab>] vlan_dev_set_rx_mode+0x2b/0x50 [8021q]
[32524.386495] [<ffffffff8141addf>] __dev_set_rx_mode+0x5f/0xb0
[32524.386500] [<ffffffff8141af8b>] dev_set_rx_mode+0x2b/0x40
[32524.386506] [<ffffffff8141b3cf>] __dev_open+0xef/0x150
[32524.386511] [<ffffffff8141b177>] __dev_change_flags+0xa7/0x190
[32524.386516] [<ffffffff8141b292>] dev_change_flags+0x32/0x80
[32524.386524] [<ffffffff8149ca56>] devinet_ioctl+0x7d6/0x830
[32524.386532] [<ffffffff81437b0b>] ? dev_ioctl+0x34b/0x660
[32524.386540] [<ffffffff814a05b0>] inet_ioctl+0x80/0xa0
[32524.386550] [<ffffffff8140199d>] sock_do_ioctl+0x2d/0x60
[32524.386558] [<ffffffff81401a52>] sock_ioctl+0x82/0x2a0
[32524.386568] [<ffffffff811a7123>] do_vfs_ioctl+0x93/0x590
[32524.386578] [<ffffffff811b2705>] ? rcu_read_lock_held+0x45/0x50
[32524.386586] [<ffffffff811b39e5>] ? __fget_light+0x105/0x110
[32524.386594] [<ffffffff811a76b1>] SyS_ioctl+0x91/0xb0
[32524.386604] [<ffffffff815057e2>] system_call_fastpath+0x16/0x1b
========================================================================
The reason is that all of the addr_lock_key for vlan dev have the same class,
so if we change the status for vlan dev, the vlan dev and its real dev will
hold the same class of addr_lock_key together, so the warning happened.
we should distinguish the lock depth for vlan dev and its real dev.
v1->v2: Convert the vlan_netdev_addr_lock_key to an array of eight elements, which
could support to add 8 vlan id on a same vlan dev, I think it is enough for current
scene, because a netdev's name is limited to IFNAMSIZ which could not hold 8 vlan id,
and the vlan dev would not meet the same class key with its real dev.
The new function vlan_dev_get_lockdep_subkey() will return the subkey and make the vlan
dev could get a suitable class key.
v2->v3: According David's suggestion, I use the subclass to distinguish the lock key for vlan dev
and its real dev, but it make no sense, because the difference for subclass in the
lock_class_key doesn't mean that the difference class for lock_key, so I use lock_depth
to distinguish the different depth for every vlan dev, the same depth of the vlan dev
could have the same lock_class_key, I import the MAX_LOCK_DEPTH from the include/linux/sched.h,
I think it is enough here, the lockdep should never exceed that value.
v3->v4: Add a huge array of locking keys will waste static kernel memory and is not a appropriate method,
we could use _nested() variants to fix the problem, calculate the depth for every vlan dev,
and use the depth as the subclass for addr_lock_key.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the dst->output() path for ipv4, the code assumes the skb it has to
transmit is attached to an inet socket, specifically via
ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the
provider of the packet is an AF_PACKET socket.
The dst->output() method gets an additional 'struct sock *sk'
parameter. This needs a cascade of changes so that this parameter can
be propagated from vxlan to final consumer.
Fixes: 8f646c922d ("vxlan: keep original skb ownership")
Reported-by: lucien xin <lucien.xin@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes, when the packet arrives at skb_mac_gso_segment()
its skb->mac_len already accounts for some of the mac lenght
headers in the packet. This seems to happen when forwarding
through and OpenSSL tunnel.
When we start looking for any vlan headers in skb_network_protocol()
we seem to ignore any of the already known mac headers and start
with an ETH_HLEN. This results in an incorrect offset, dropped
TSO frames and general slowness of the connection.
We can start counting from the known skb->mac_len
and return at least that much if all mac level headers
are known and accounted for.
Fixes: 53d6471cef (net: Account for all vlan headers in skb_mac_gso_segment)
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Daniel Borkman <dborkman@redhat.com>
Tested-by: Martin Filip <nexus+kernel@smoula.net>
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
been wrongly decoded by commit a8fc927780 ("sk-filter: Add ability to
get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
In practice, this should not have much side-effect though, as such
conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
operation PR_GET_SECCOMP will only return the current seccomp mode, but
not the filter itself. Since the transition to the new BPF infrastructure,
it's also not used anymore, so we can simply remove this as it's
unreachable.
Fixes: a8fc927780 ("sk-filter: Add ability to get socket filter program (v2)")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BPF_S_ANC_NLATTR and BPF_S_ANC_NLATTR_NEST extensions fail to check
for a minimal message length before testing the supplied offset to be
within the bounds of the message. This allows the subtraction of the nla
header to underflow and therefore -- as the data type is unsigned --
allowing far to big offset and length values for the search of the
netlink attribute.
The remainder calculation for the BPF_S_ANC_NLATTR_NEST extension is
also wrong. It has the minuend and subtrahend mixed up, therefore
calculates a huge length value, allowing to overrun the end of the
message while looking for the netlink attribute.
The following three BPF snippets will trigger the bugs when attached to
a UNIX datagram socket and parsing a message with length 1, 2 or 3.
,-[ PoC for missing size check in BPF_S_ANC_NLATTR ]--
| ld #0x87654321
| ldx #42
| ld #nla
| ret a
`---
,-[ PoC for the same bug in BPF_S_ANC_NLATTR_NEST ]--
| ld #0x87654321
| ldx #42
| ld #nlan
| ret a
`---
,-[ PoC for wrong remainder calculation in BPF_S_ANC_NLATTR_NEST ]--
| ; (needs a fake netlink header at offset 0)
| ld #0
| ldx #42
| ld #nlan
| ret a
`---
Fix the first issue by ensuring the message length fulfills the minimal
size constrains of a nla header. Fix the second bug by getting the math
for the remainder calculation right.
Fixes: 4738c1db15 ("[SKFILTER]: Add SKF_ADF_NLATTR instruction")
Fixes: d214c7537b ("filter: add SKF_AD_NLATTR_NEST to look for nested..")
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly to commit 43279500de ("packet: respect devices with
LLTX flag in direct xmit"), we can basically apply the very same
to pktgen. This will help testing against LLTX devices such as
dummy driver (or others), which only have a single netdevice txq
and would otherwise require locking their txq from pktgen side
while e.g. in dummy case, we would not need any locking. Fix this
by making use of HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will
be respected.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several spots in the kernel perform a sequence like:
skb_queue_tail(&sk->s_receive_queue, skb);
sk->sk_data_ready(sk, skb->len);
But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up. So this skb->len access is potentially
to freed up memory.
Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.
And finally, no actual implementation of this callback actually uses
the length argument. And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.
So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.
Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of tcp, gso_size contains the tcpmss.
For UFO (udp fragmentation offloading) skbs, gso_size is the fragment
payload size, i.e. we must not account for udp header size.
Otherwise, when using virtio drivers, a to-be-forwarded UFO GSO packet
will be needlessly fragmented in the forward path, because we think its
individual segments are too large for the outgoing link.
Fixes: fe6cc55f3a ("net: ip, ipv6: handle gso skbs in forwarding path")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Reported-by: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull more networking updates from David Miller:
1) If a VXLAN interface is created with no groups, we can crash on
reception of packets. Fix from Mike Rapoport.
2) Missing includes in CPTS driver, from Alexei Starovoitov.
3) Fix string validations in isdnloop driver, from YOSHIFUJI Hideaki
and Dan Carpenter.
4) Missing irq.h include in bnxw2x, enic, and qlcnic drivers. From
Josh Boyer.
5) AF_PACKET transmit doesn't statistically count TX drops, from Daniel
Borkmann.
6) Byte-Queue-Limit enabled drivers aren't handled properly in
AF_PACKET transmit path, also from Daniel Borkmann.
Same problem exists in pktgen, and Daniel fixed it there too.
7) Fix resource leaks in driver probe error paths of new sxgbe driver,
from Francois Romieu.
8) Truesize of SKBs can gradually get more and more corrupted in NAPI
packet recycling path, fix from Eric Dumazet.
9) Fix uniprocessor netfilter build, from Florian Westphal. In the
longer term we should perhaps try to find a way for ARRAY_SIZE() to
work even with zero sized array elements.
10) Fix crash in netfilter conntrack extensions due to mis-estimation of
required extension space. From Andrey Vagin.
11) Since we commit table rule updates before trying to copy the
counters back to userspace (it's the last action we perform), we
really can't signal the user copy with an error as we are beyond the
point from which we can unwind everything. This causes all kinds of
use after free crashes and other mysterious behavior.
From Thomas Graf.
12) Restore previous behvaior of div/mod by zero in BPF filter
processing. From Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
net: sctp: wake up all assocs if sndbuf policy is per socket
isdnloop: several buffer overflows
netdev: remove potentially harmful checks
pktgen: fix xmit test for BQL enabled devices
net/at91_ether: avoid NULL pointer dereference
tipc: Let tipc_release() return 0
at86rf230: fix MAX_CSMA_RETRIES parameter
mac802154: fix duplicate #include headers
sxgbe: fix duplicate #include headers
net: filter: be more defensive on div/mod by X==0
netfilter: Can't fail and free after table replacement
xen-netback: Trivial format string fix
net: bcmgenet: Remove unnecessary version.h inclusion
net: smc911x: Remove unused local variable
bonding: Inactive slaves should keep inactive flag's value
netfilter: nf_tables: fix wrong format in request_module()
netfilter: nf_tables: set names cannot be larger than 15 bytes
netfilter: nf_conntrack: reserve two bytes for nf_ct_ext->len
netfilter: Add {ipt,ip6t}_osf aliases for xt_osf
netfilter: x_tables: allow to use cgroup match for LOCAL_IN nf hooks
...
The purpose of this single series of commits from Srivatsa S Bhat (with
a small piece from Gautham R Shenoy) touching multiple subsystems that use
CPU hotplug notifiers is to provide a way to register them that will not
lead to deadlocks with CPU online/offline operations as described in the
changelog of commit 93ae4f978c (CPU hotplug: Provide lockless versions
of callback registration functions).
The first three commits in the series introduce the API and document it
and the rest simply goes through the users of CPU hotplug notifiers and
converts them to using the new method.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTQow2AAoJEILEb/54YlRxW4QQAJlYRDUzwFJzJzYhltQYuVR+
4D74XMtvXgoJfg3cwdSWvMKKpJZnA9BVN0f7Hcx9wYmgdexYUuHeZJmMNyc3S2+g
KjKBIsugvgmZhHbbLd6TJ6GBbhGT5JLt9VmSfL9zIkveInU1YHFUUqL/mxdHm4J0
BSGKjk2rN3waRJgmY+xfliFLtQjDKFwJpMuvrgtoUyfas3f4sIV43UNbqdvA/weJ
rzedxXOlKH/id4b56lj/4iIzcoL3mwvJJ7r6n0CEMsKv87z09kqR0O+69Tsq/cgs
j17CsvoJOmZGk3QTeKVMQWBsvk6aPoDu3zK83gLbQMt+qjOpSTbJLz/3HZw4/TrW
ss4nuZne1DLMGS+6hoxYbTP+6Ni//Kn+l/LrHc5jb7m1X3lMO4W2aV3IROtIE1rv
lEP1IG01NU4u9YwkVj1dyhrkSp8tLPul4SrUK8W+oNweOC5crjJV7vJbIPJgmYiM
IZN55wln0yVRtR4TX+rmvN0PixsInE8MeaVCmReApyF9pdzul/StxlBze5BKLSJD
cqo1kNPpsmdxoDucqUpQ/gSvy+IOl2qnlisB5PpV93sk7De6TFDYrGHxjYIW7jMf
StXwdCDDQhzd2Q8Kfpp895A1dbIl8rKtwA6bTU2eX+BfMVFzuMdT44cvosx1+UdQ
sWl//rg76nb13dFjvF+q
=SW7Q
-----END PGP SIGNATURE-----
Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
(with a small piece from Gautham R Shenoy) touching multiple
subsystems that use CPU hotplug notifiers is to provide a way to
register them that will not lead to deadlocks with CPU online/offline
operations as described in the changelog of commit 93ae4f978c ("CPU
hotplug: Provide lockless versions of callback registration
functions").
The first three commits in the series introduce the API and document
it and the rest simply goes through the users of CPU hotplug notifiers
and converts them to using the new method"
* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
net/iucv/iucv.c: Fix CPU hotplug callback registration
net/core/flow.c: Fix CPU hotplug callback registration
mm, zswap: Fix CPU hotplug callback registration
mm, vmstat: Fix CPU hotplug callback registration
profile: Fix CPU hotplug callback registration
trace, ring-buffer: Fix CPU hotplug callback registration
xen, balloon: Fix CPU hotplug callback registration
hwmon, via-cputemp: Fix CPU hotplug callback registration
hwmon, coretemp: Fix CPU hotplug callback registration
thermal, x86-pkg-temp: Fix CPU hotplug callback registration
octeon, watchdog: Fix CPU hotplug callback registration
oprofile, nmi-timer: Fix CPU hotplug callback registration
intel-idle: Fix CPU hotplug callback registration
clocksource, dummy-timer: Fix CPU hotplug callback registration
drivers/base/topology.c: Fix CPU hotplug callback registration
acpi-cpufreq: Fix CPU hotplug callback registration
zsmalloc: Fix CPU hotplug callback registration
scsi, fcoe: Fix CPU hotplug callback registration
scsi, bnx2fc: Fix CPU hotplug callback registration
scsi, bnx2i: Fix CPU hotplug callback registration
...
Currently we're checking a variable for != NULL after actually
dereferencing it, in netdev_lower_get_next_private*().
It's counter-intuitive at best, and can lead to faulty usage (as it implies
that the variable can be NULL), so fix it by removing the useless checks.
Reported-by: Daniel Borkmann <dborkman@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: stephen hemminger <stephen@networkplumber.org>
CC: Jerry Chu <hkchu@google.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similarly as in commit 8e2f1a63f2 ("packet: fix packet_direct_xmit
for BQL enabled drivers"), we test for __QUEUE_STATE_STACK_XOFF bit
in pktgen's xmit, which would not fully fill the device's TX ring for
BQL drivers that use netdev_tx_sent_queue(). Fix is to use, similarly
as we do in packet sockets, netif_xmit_frozen_or_drv_stopped() test.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The old interpreter behaviour was that we returned with 0
whenever we found a division by 0 would take place. In the new
interpreter we would currently just skip that instead and
continue execution.
It's true that a value of 0 as return might not be appropriate
in all cases, but current users (socket filters -> drop
packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified,
etc) seem fine with that behaviour. Better this than undefined
BPF program behaviour as it's expected that A contains the
result of the division. In future, as more use cases open up,
we could further adapt this return value to our needs, if
necessary.
So reintroduce return of 0 for division by 0 as in the old
interpreter. Also in case of K which is guaranteed to be 32bit
wide, sk_chk_filter() already takes care of preventing division
by 0 invoked through K, so we can generally spare us these tests.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Reviewed-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recycling skb always had been very tough...
This time it appears GRO layer can accumulate skb->truesize
adjustments made by drivers when they attach a fragment to skb.
skb_gro_receive() can only subtract from skb->truesize the used part
of a fragment.
I spotted this problem seeing TcpExtPruneCalled and
TcpExtTCPRcvCollapsed that were unexpected with a recent kernel, where
TCP receive window should be sized properly to accept traffic coming
from a driver not overshooting skb->truesize.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull cgroup updates from Tejun Heo:
"A lot updates for cgroup:
- The biggest one is cgroup's conversion to kernfs. cgroup took
after the long abandoned vfs-entangled sysfs implementation and
made it even more convoluted over time. cgroup's internal objects
were fused with vfs objects which also brought in vfs locking and
object lifetime rules. Naturally, there are places where vfs rules
don't fit and nasty hacks, such as credential switching or lock
dance interleaving inode mutex and cgroup_mutex with object serial
number comparison thrown in to decide whether the operation is
actually necessary, needed to be employed.
After conversion to kernfs, internal object lifetime and locking
rules are mostly isolated from vfs interactions allowing shedding
of several nasty hacks and overall simplification. This will also
allow implmentation of operations which may affect multiple cgroups
which weren't possible before as it would have required nesting
i_mutexes.
- Various simplifications including dropping of module support,
easier cgroup name/path handling, simplified cgroup file type
handling and task_cg_lists optimization.
- Prepatory changes for the planned unified hierarchy, which is still
a patchset away from being actually operational. The dummy
hierarchy is updated to serve as the default unified hierarchy.
Controllers which aren't claimed by other hierarchies are
associated with it, which BTW was what the dummy hierarchy was for
anyway.
- Various fixes from Li and others. This pull request includes some
patches to add missing slab.h to various subsystems. This was
triggered xattr.h include removal from cgroup.h. cgroup.h
indirectly got included a lot of files which brought in xattr.h
which brought in slab.h.
There are several merge commits - one to pull in kernfs updates
necessary for converting cgroup (already in upstream through
driver-core), others for interfering changes in the fixes branch"
* 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
cgroup: remove useless argument from cgroup_exit()
cgroup: fix spurious lockdep warning in cgroup_exit()
cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
cgroup: break kernfs active_ref protection in cgroup directory operations
cgroup: fix cgroup_taskset walking order
cgroup: implement CFTYPE_ONLY_ON_DFL
cgroup: make cgrp_dfl_root mountable
cgroup: drop const from @buffer of cftype->write_string()
cgroup: rename cgroup_dummy_root and related names
cgroup: move ->subsys_mask from cgroupfs_root to cgroup
cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
cgroup: reorganize cgroup bootstrapping
cgroup: relocate setting of CGRP_DEAD
cpuset: use rcu_read_lock() to protect task_cs()
cgroup_freezer: document freezer_fork() subtleties
cgroup: update cgroup_transfer_tasks() to either succeed or fail
cgroup: drop task_lock() protection around task->cgroups
cgroup: update how a newly forked task gets associated with css_set
...
Currently there is no way how to find out if a device supports busy
polling. So add a feature and make it dependent on ndo_busy_poll
existence.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Here is my initial pull request for the networking subsystem during
this merge window:
1) Support for ESN in AH (RFC 4302) from Fan Du.
2) Add full kernel doc for ethtool command structures, from Ben
Hutchings.
3) Add BCM7xxx PHY driver, from Florian Fainelli.
4) Export computed TCP rate information in netlink socket dumps, from
Eric Dumazet.
5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
Dichtel.
6) Convert many drivers to pci_enable_msix_range(), from Alexander
Gordeev.
7) Record SKB timestamps more efficiently, from Eric Dumazet.
8) Switch to microsecond resolution for TCP round trip times, also
from Eric Dumazet.
9) Clean up and fix 6lowpan fragmentation handling by making use of
the existing inet_frag api for it's implementation.
10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.
11) Auto size SKB lengths when composing netlink messages based upon
past message sizes used, from Eric Dumazet.
12) qdisc dumps can take a long time, add a cond_resched(), From Eric
Dumazet.
13) Sanitize netpoll core and drivers wrt. SKB handling semantics.
Get rid of never-used-in-tree netpoll RX handling. From Eric W
Biederman.
14) Support inter-address-family and namespace changing in VTI tunnel
driver(s). From Steffen Klassert.
15) Add Altera TSE driver, from Vince Bridgers.
16) Optimizing csum_replace2() so that it doesn't adjust the checksum
by checksumming the entire header, from Eric Dumazet.
17) Expand BPF internal implementation for faster interpreting, more
direct translations into JIT'd code, and much cleaner uses of BPF
filtering in non-socket ocntexts. From Daniel Borkmann and Alexei
Starovoitov"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
net: Add a test to see if a skb is freeable in irq context
qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
net: ptp: move PTP classifier in its own file
net: sxgbe: make "core_ops" static
net: sxgbe: fix logical vs bitwise operation
net: sxgbe: sxgbe_mdio_register() frees the bus
Call efx_set_channels() before efx->type->dimension_resources()
xen-netback: disable rogue vif in kthread context
net/mlx4: Set proper build dependancy with vxlan
be2net: fix build dependency on VxLAN
mac802154: make csma/cca parameters per-wpan
mac802154: allow only one WPAN to be up at any given time
net: filter: minor: fix kdoc in __sk_run_filter
netlink: don't compare the nul-termination in nla_strcmp
can: c_can: Avoid led toggling for every packet.
can: c_can: Simplify TX interrupt cleanup
can: c_can: Store dlc private
can: c_can: Reduce register access
can: c_can: Make the code readable
...
Pull core block layer updates from Jens Axboe:
"This is the pull request for the core block IO bits for the 3.15
kernel. It's a smaller round this time, it contains:
- Various little blk-mq fixes and additions from Christoph and
myself.
- Cleanup of the IPI usage from the block layer, and associated
helper code. From Frederic Weisbecker and Jan Kara.
- Duplicate code cleanup in bio-integrity from Gu Zheng. This will
give you a merge conflict, but that should be easy to resolve.
- blk-mq notify spinlock fix for RT from Mike Galbraith.
- A blktrace partial accounting bug fix from Roman Pen.
- Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"
* 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
blk-mq: add REQ_SYNC early
rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
blk-mq: support partial I/O completions
blk-mq: merge blk_mq_insert_request and blk_mq_run_request
blk-mq: remove blk_mq_alloc_rq
blk-mq: don't dump CPU -> hw queue map on driver load
blk-mq: fix wrong usage of hctx->state vs hctx->flags
blk-mq: allow blk_mq_init_commands() to return failure
block: remove old blk_iopoll_enabled variable
blktrace: fix accounting of partially completed requests
smp: Rename __smp_call_function_single() to smp_call_function_single_async()
smp: Remove wait argument from __smp_call_function_single()
watchdog: Simplify a little the IPI call
smp: Move __smp_call_function_single() below its safe version
smp: Consolidate the various smp_call_function_single() declensions
smp: Teach __smp_call_function_single() to check for offline cpus
smp: Remove unused list_head from csd
smp: Iterate functions through llist_for_each_entry_safe()
block: Stop abusing rq->csd.list in blk-softirq
block: Remove useless IPI struct initialization
...
Replace the test in zap_completion_queue to test when it is safe to
free skbs in hard irq context with skb_irq_freeable ensuring we only
free skbs when it is safe, and removing the possibility of subtle
problems.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit fixes a build error reported by Fengguang, that is
triggered when CONFIG_NETWORK_PHY_TIMESTAMPING is not set:
ERROR: "ptp_classify_raw" [drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.ko] undefined!
The fix is to introduce its own file for the PTP BPF classifier,
so that PTP_1588_CLOCK and/or NETWORK_PHY_TIMESTAMPING can select
it independently from each other. IXP4xx driver on ARM needs to
select it as well since it does not seem to select PTP_1588_CLOCK
or similar that would pull it in automatically.
This also allows for hiding all of the internals of the BPF PTP
program inside that file, and only exporting relevant API bits
to drivers.
This patch also adds a kdoc documentation of ptp_classify_raw()
API to make it clear that it can return PTP_CLASS_* defines. Also,
the BPF program has been translated into bpf_asm code, so that it
can be more easily read and altered (extensively documented in [1]).
In the kernel tree under tools/net/ we have bpf_asm and bpf_dbg
tools, so the commented program can simply be translated via
`./bpf_asm -c prog` where prog is a file that contains the
commented code. This makes it easily readable/verifiable and when
there's a need to change something, jump offsets etc do not need
to be replaced manually which can be very error prone. Instead,
a newly translated version via bpf_asm can simply replace the old
code. I have checked opcode diffs before/after and it's the very
same filter.
[1] Documentation/networking/filter.txt
Fixes: 164d8c6665 ("net: ptp: do not reimplement PTP/BPF classifier")
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This minor patch fixes the following warning when doing
a `make htmldocs`:
DOCPROC Documentation/DocBook/networking.xml
Warning(.../net/core/filter.c:135): No description found for parameter 'insn'
Warning(.../net/core/filter.c:135): Excess function parameter 'fentry' description in '__sk_run_filter'
HTML Documentation/DocBook/networking.html
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Main difference between napi_frags_skb() and napi_gro_receive() is that
the later is called while ethernet header was already pulled by the NIC
driver (eth_type_trans() was called before napi_gro_receive())
Jerry Chu in commit 299603e837 ("net-gro: Prepare GRO stack for the
upcoming tunneling support") tried to remove this difference by calling
eth_type_trans() from napi_frags_skb() instead of doing this later from
napi_frags_finish()
Goal was that napi_gro_complete() could call
ptype->callbacks.gro_complete(skb, 0) (offset of first network header =
0)
Also, xxx_gro_receive() handlers all use off = skb_gro_offset(skb) to
point to their own header, for the current skb and ones held in gro_list
Problem is this cleanup work defeated the frag0 optimization:
It turns out the consecutive pskb_may_pull() calls are too expensive.
This patch brings back the frag0 stuff in napi_frags_skb().
As all skb have their mac header in skb head, we no longer need
skb_gro_mac_header()
Reported-by: Michal Schmidt <mschmidt@redhat.com>
Fixes: 299603e837 ("net-gro: Prepare GRO stack for the upcoming tunneling support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows to monitor carrier on/off transitions and detect link
flapping issues:
- new /sys/class/net/X/carrier_changes
- new rtnetlink IFLA_CARRIER_CHANGES (getlink)
Tested:
- grep . /sys/class/net/*/carrier_changes
+ ip link set dev X down/up
+ plug/unplug cable
- updated iproute2: prints IFLA_CARRIER_CHANGES
- iproute2 20121211-2 (debian): unchanged behavior
Signed-off-by: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces/reworks the kernel-internal BPF interpreter with
an optimized BPF instruction set format that is modelled closer to
mimic native instruction sets and is designed to be JITed with one to
one mapping. Thus, the new interpreter is noticeably faster than the
current implementation of sk_run_filter(); mainly for two reasons:
1. Fall-through jumps:
BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty. The new BPF jump
instructions have only one branch and fall-through otherwise,
which fits the CPU branch predictor logic better. `perf stat`
shows drastic difference for branch-misses between the old and
new code.
2. Jump-threaded implementation of interpreter vs switch
statement:
Instead of single table-jump at the top of 'switch' statement,
gcc will now generate multiple table-jump instructions, which
helps CPU branch predictor logic.
Note that the verification of filters is still being done through
sk_chk_filter() in classical BPF format, so filters from user- or
kernel space are verified in the same way as we do now, and same
restrictions/constraints hold as well.
We reuse current BPF JIT compilers in a way that this upgrade would
even be fine as is, but nevertheless allows for a successive upgrade
of BPF JIT compilers to the new format.
The internal instruction set migration is being done after the
probing for JIT compilation, so in case JIT compilers are able to
create a native opcode image, we're going to use that, and in all
other cases we're doing a follow-up migration of the BPF program's
instruction set, so that it can be transparently run in the new
interpreter.
In short, the *internal* format extends BPF in the following way (more
details can be taken from the appended documentation):
- Number of registers increase from 2 to 10
- Register width increases from 32-bit to 64-bit
- Conditional jt/jf targets replaced with jt/fall-through
- Adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced
with up to 512 bytes of multi-use stack space
- Introduction of bpf_call insn and register passing convention
for zero overhead calls from/to other kernel functions
- Adds arithmetic right shift and endianness conversion insns
- Adds atomic_add insn
- Old tax/txa insns are replaced with 'mov dst,src' insn
Performance of two BPF filters generated by libpcap resp. bpf_asm
was measured on x86_64, i386 and arm32 (other libpcap programs
have similar performance differences):
fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd
fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd
Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
smaller is better:
--x86_64--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 90 101 192 202
new BPF 31 71 47 97
old BPF jit 12 34 17 44
new BPF jit TBD
--i386--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 107 136 227 252
new BPF 40 119 69 172
--arm32--
fprog #1 fprog #1 fprog #2 fprog #2
cache-hit cache-miss cache-hit cache-miss
old BPF 202 300 475 540
new BPF 180 270 330 470
old BPF jit 26 182 37 202
new BPF jit TBD
Thus, without changing any userland BPF filters, applications on
top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
classifier, netfilter's xt_bpf, team driver's load-balancing mode,
and many more will have better interpreter filtering performance.
While we are replacing the internal BPF interpreter, we also need
to convert seccomp BPF in the same step to make use of the new
internal structure since it makes use of lower-level API details
without being further decoupled through higher-level calls like
sk_unattached_filter_{create,destroy}(), for example.
Just as for normal socket filtering, also seccomp BPF experiences
a time-to-verdict speedup:
05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
seccomp_rule_add_exact(ctx,...
seccomp_rule_add_exact(ctx,...
rc = seccomp_load(ctx);
for (i = 0; i < 10000000; i++)
syscall(199, 100);
'short filter' has 2 rules
'large filter' has 200 rules
'short filter' performance is slightly better on x86_64/i386/arm32
'large filter' is much faster on x86_64 and i386 and shows no
difference on arm32
--x86_64-- short filter
old BPF: 2.7 sec
39.12% bench libc-2.15.so [.] syscall
8.10% bench [kernel.kallsyms] [k] sk_run_filter
6.31% bench [kernel.kallsyms] [k] system_call
5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller
3.70% bench [kernel.kallsyms] [k] __secure_computing
3.67% bench [kernel.kallsyms] [k] lock_is_held
3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load
new BPF: 2.58 sec
42.05% bench libc-2.15.so [.] syscall
6.91% bench [kernel.kallsyms] [k] system_call
6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller
6.07% bench [kernel.kallsyms] [k] __secure_computing
5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
--arm32-- short filter
old BPF: 4.0 sec
39.92% bench [kernel.kallsyms] [k] vector_swi
16.60% bench [kernel.kallsyms] [k] sk_run_filter
14.66% bench libc-2.17.so [.] syscall
5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load
5.10% bench [kernel.kallsyms] [k] __secure_computing
new BPF: 3.7 sec
35.93% bench [kernel.kallsyms] [k] vector_swi
21.89% bench libc-2.17.so [.] syscall
13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
6.25% bench [kernel.kallsyms] [k] __secure_computing
3.96% bench [kernel.kallsyms] [k] syscall_trace_exit
--x86_64-- large filter
old BPF: 8.6 seconds
73.38% bench [kernel.kallsyms] [k] sk_run_filter
10.70% bench libc-2.15.so [.] syscall
5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.97% bench [kernel.kallsyms] [k] system_call
new BPF: 5.7 seconds
66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
16.75% bench libc-2.15.so [.] syscall
3.31% bench [kernel.kallsyms] [k] system_call
2.88% bench [kernel.kallsyms] [k] __secure_computing
--i386-- large filter
old BPF: 5.4 sec
new BPF: 3.8 sec
--arm32-- large filter
old BPF: 13.5 sec
73.88% bench [kernel.kallsyms] [k] sk_run_filter
10.29% bench [kernel.kallsyms] [k] vector_swi
6.46% bench libc-2.17.so [.] syscall
2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load
1.19% bench [kernel.kallsyms] [k] __secure_computing
0.87% bench [kernel.kallsyms] [k] sys_getuid
new BPF: 13.5 sec
76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp
10.98% bench [kernel.kallsyms] [k] vector_swi
5.87% bench libc-2.17.so [.] syscall
1.77% bench [kernel.kallsyms] [k] __secure_computing
0.93% bench [kernel.kallsyms] [k] sys_getuid
BPF filters generated by seccomp are very branchy, so the new
internal BPF performance is better than the old one. Performance
gains will be even higher when BPF JIT is committed for the
new structure, which is planned in future work (as successive
JIT migrations).
BPF has also been stress-tested with trinity's BPF fuzzer.
Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Paul Moore <pmoore@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are currently pch_gbe, cpts, and ixp4xx_eth drivers that open-code
and reimplement a BPF classifier for the PTP protocol. Since all of them
effectively do the very same thing and load the very same PTP/BPF filter,
we can just consolidate that code by introducing ptp_classify_raw() in
the time-stamping core framework which can be used in drivers.
As drivers get initialized after bootstrapping the core networking
subsystem, they can make use of ptp_insns wrapped through
ptp_classify_raw(), which allows to simplify and remove PTP classifier
setup code in drivers.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Richard Cochran <richard.cochran@omicron.at>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch migrates an open-coded sk_run_filter() implementation with
proper use of the BPF API, that is, sk_unattached_filter_create(). This
migration is needed, as we will be internally transforming the filter
to a different representation, and therefore needs to be decoupled.
It is okay to do so as skb_timestamping_init() is called during
initialization of the network stack in core initcall via sock_init().
This would effectively also allow for PTP filters to be jit compiled if
bpf_jit_enable is set.
For better readability, there are also some newlines introduced, also
ptp_classify.h is only in kernel space.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Richard Cochran <richard.cochran@omicron.at>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch basically does two things, i) removes the extern keyword
from the include/linux/filter.h file to be more consistent with the
rest of Joe's changes, and ii) moves filter accounting into the filter
core framework.
Filter accounting mainly done through sk_filter_{un,}charge() take
care of the case when sockets are being cloned through sk_clone_lock()
so that removal of the filter on one socket won't result in eviction
as it's still referenced by the other.
These functions actually belong to net/core/filter.c and not
include/net/sock.h as we want to keep all that in a central place.
It's also not in fast-path so uninlining them is fine and even allows
us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward
declaration.
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to open up the possibility to internally transform a BPF program
into an alternative and possibly non-trivial reversible representation, we
need to keep the original BPF program around, so that it can be passed back
to user space w/o the need of a complex decoder.
The reason for that use case resides in commit a8fc927780 ("sk-filter:
Add ability to get socket filter program (v2)"), that is, the ability
to retrieve the currently attached BPF filter from a given socket used
mainly by the checkpoint-restore project, for example.
Therefore, we add two helpers sk_{store,release}_orig_filter for taking
care of that. In the sk_unattached_filter_create() case, there's no such
possibility/requirement to retrieve a loaded BPF program. Therefore, we
can spare us the work in that case.
This approach will simplify and slightly speed up both, sk_get_filter()
and sock_diag_put_filterinfo() handlers as we won't need to successively
decode filters anymore through sk_decode_filter(). As we still need
sk_decode_filter() later on, we're keeping it around.
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a jited flag into sk_filter struct in order to indicate
whether a filter is currently jited or not. The size of sk_filter is
not being expanded as the 32 bit 'len' member allows upper bits to be
reused since a filter can currently only grow as large as BPF_MAXINSNS.
Therefore, there's enough room also for other in future needed flags to
reuse 'len' field if necessary. The jited flag also allows for having
alternative interpreter functions running as currently, we can only
detect jit compiled filters by testing fp->bpf_func to not equal the
address of sk_run_filter().
Joint work with Alexei Starovoitov.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/marvell/mvneta.c
The mvneta.c conflict is a case of overlapping changes,
a conversion to devm_ioremap_resource() vs. a conversion
to netdev_alloc_pcpu_stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop taking the transmit lock when a network device has specified
NETIF_F_LLTX.
If no locks needed to trasnmit a packet this is the ideal scenario for
netpoll as all packets can be trasnmitted immediately.
Even if some locks are needed in ndo_start_xmit skipping any unnecessary
serialization is desirable for netpoll as it makes it more likely a
debugging packet may be trasnmitted immediately instead of being
deferred until later.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the assumption that the skbs that make it to
netpoll_send_skb_on_dev are allocated with find_skb, such that
skb->users == 1 and nothing is attached that would prevent the skbs from
being freed from hard irq context.
Remove this assumption by replacing __kfree_skb on error paths with
dev_kfree_skb_irq (in hard irq context) and kfree_skb (in process
context).
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The netpoll_rx_enable and netpoll_rx_disable functions have always
controlled polling the network drivers transmit and receive queues.
Rename them to netpoll_poll_enable and netpoll_poll_disable to make
their functionality clear.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today netpoll_rx_enable and netpoll_rx_disable are called from
dev_close and and __dev_close, and not from dev_close_many.
Move the calls into __dev_close_many so that we have a single call
site to maintain, and so that dev_close_many gains this protection as
well. Which importantly makes batched network device deletes safe.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Factor out the code that needs to surround ndo_start_xmit
from netpoll_send_skb_on_dev into netpoll_start_xmit.
It is an unfortunate fact that as the netpoll code has been maintained
the primary call site ndo_start_xmit learned how to handle vlans
and timestamps but the second call of ndo_start_xmit in queue_process
did not.
With the introduction of netpoll_start_xmit this associated logic now
happens at both call sites of ndo_start_xmit and should make it easy
for that to continue into the future.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The gfp parameter was added in:
commit 47be03a28c
Author: Amerigo Wang <amwang@redhat.com>
Date: Fri Aug 10 01:24:37 2012 +0000
netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()
slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The reason for the gfp parameter was removed in:
commit c4cdef9b71
Author: dingtianhong <dingtianhong@huawei.com>
Date: Tue Jul 23 15:25:27 2013 +0800
bonding: don't call slave_xxx_netpoll under spinlocks
The slave_xxx_netpoll will call synchronize_rcu_bh(),
so the function may schedule and sleep, it should't be
called under spinlocks.
bond_netpoll_setup() and bond_netpoll_cleanup() are always
protected by rtnl lock, it is no need to take the read lock,
as the slave list couldn't be changed outside rtnl lock.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing else that calls __netpoll_setup or ndo_netpoll_setup
requires a gfp paramter, so remove the gfp parameter from both
of these functions making the code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_network_protocol() already accounts for multiple vlan
headers that may be present in the skb. However, skb_mac_gso_segment()
doesn't know anything about it and assumes that skb->mac_len
is set correctly to skip all mac headers. That may not
always be the case. If we are simply forwarding the packet (via
bridge or macvtap), all vlan headers may not be accounted for.
A simple solution is to allow skb_network_protocol to return
the vlan depth it has calculated. This way skb_mac_gso_segment
will correctly skip all mac headers.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dropping packets in __dev_queue_xmit() when transmit queue
is stopped (NIC TX ring buffer full or BQL limit reached) currently
outputs a syslog message.
It would be better to get a precise count of such events available in
netdevice stats so that monitoring tools can have a clue.
This extends the work done in caf586e5f2
("net: add a core netdev->rx_dropped counter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_zerocopy can copy elements of the frags array between skbs, but it doesn't
orphan them. Also, it doesn't handle errors, so this patch takes care of that
as well, and modify the callers accordingly. skb_tx_error() is also added to
the callers so they will signal the failed delivery towards the creator of the
skb.
Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch replaces rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
The rcu_assign_pointer() ensures that the initialization of a structure
is carried out before storing a pointer to that structure.
And in the case of the NULL pointer, there is no structure to initialize.
So, rcu_assign_pointer(p, NULL) can be safely converted to RCU_INIT_POINTER(p, NULL)
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The packet hash can be considered a property of the packet, not just
on RX path.
This patch changes name of rxhash and l4_rxhash skbuff fields to be
hash and l4_hash respectively. This includes changing uses of the
field in the code which don't call the access functions.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
Documentation/devicetree/bindings/net/micrel-ks8851.txt
net/core/netpoll.c
The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.
In micrel-ks8851.txt we simply have overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Neighbor Solicitation is ipv6 protocol, so we should check
skb->protocol with ETH_P_IPV6
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Cc: WANG Cong <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3ff661c38c ("net: rtnetlink notify events for FDB NTF_SELF adds and
deletes") reuses the function nlmsg_populate_fdb_fill() to notify fdb events.
But this function was used only for dump and thus was always setting the
flag NLM_F_MULTI, which is wrong in case of a single notification.
Libraries like libnl will wait forever for NLMSG_DONE.
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Instead, the correct and race-free way of performing the callback
registration is:
cpu_notifier_register_begin();
for_each_online_cpu(cpu)
init_cpu(cpu);
/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);
cpu_notifier_register_done();
Fix the code in net/core/flow.c by using this latter form of callback
registration.
Cc: Li RongQing <roy.qing.li@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cftype->write_string() just passes on the writeable buffer from kernfs
and there's no reason to add const restriction on the buffer. The
only thing const achieves is unnecessarily complicating parsing of the
buffer. Drop const from @buffer.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
The netpoll packet receive code only becomes active if the netpoll
rx_skb_hook is implemented, and there is not a single implementation
of the netpoll rx_skb_hook in the kernel.
All of the out of tree implementations I have found all call
netpoll_poll which was removed from the kernel in 2011, so this
change should not add any additional breakage.
There are problems with the netpoll packet receive code. __netpoll_rx
does not call dev_kfree_skb_irq or dev_kfree_skb_any in hard irq
context. netpoll_neigh_reply leaks every skb it receives. Reception
of packets does not work successfully on stacked devices (aka bonding,
team, bridge, and vlans).
Given that the netpoll packet receive code is buggy, there are no
out of tree users that will be merged soon, and the code has
not been used for in tree for a decade let's just remove it.
Reverting this commit can server as a starting point for anyone
who wants to resurrect netpoll packet reception support.
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make rx_skb_hook, and rx in struct netpoll depend on
CONFIG_NETPOLL_TRAP Make rx_lock, rx_np, and neigh_tx in struct
netpoll_info depend on CONFIG_NETPOLL_TRAP
Make the functions netpoll_rx_on, netpoll_rx, and netpoll_receive_skb
no-ops when CONFIG_NETPOLL_TRAP is not set.
Only build netpoll_neigh_reply, checksum_udp service_neigh_queue,
pkt_is_ns, and __netpoll_rx when CONFIG_NETPOLL_TRAP is defined.
Add helper functions netpoll_trap_setup, netpoll_trap_setup_info,
netpoll_trap_cleanup, and netpoll_trap_cleanup_info that initialize
and cleanup the struct netpoll and struct netpoll_info receive
specific fields when CONFIG_NETPOLL_TRAP is enabled and do nothing
otherwise.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the bond slave device neigh_tx handling into service_neigh_queue.
In connection with neigh_tx processing remove unnecessary tests of
a NULL netpoll_info. As the netpoll_poll_dev has already used
and thus verified the existince of the netpoll_info.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we no longer need to receive packets to safely drain the
network drivers receive queue move netpoll_trap and netpoll_set_trap
under CONFIG_NETPOLL_TRAP
Making netpoll_trap and netpoll_set_trap noop inline functions
when CONFIG_NETPOLL_TRAP is not set.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the strategy of netpoll from dropping all packets received
during netpoll_poll_dev to calling napi poll with a budget of 0
(to avoid processing drivers rx queue), and to ignore packets received
with netif_rx (those will safely be placed on the backlog queue).
All of the netpoll supporting drivers have been reviewed to ensure
either thay use netif_rx or that a budget of 0 is supported by their
napi poll routine and that a budget of 0 will not process the drivers
rx queues.
Not dropping packets makes NETPOLL_RX_DROP unnecesary so it is removed.
npinfo->rx_flags is removed as rx_flags with just the NETPOLL_RX_ENABLED
flag becomes just a redundant mirror of list_empty(&npinfo->rx_np).
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper netpoll_rx_processing that reports when netpoll has
receive side processing to perform.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is already a warning for this case in the normal netpoll path,
but put a copy here in case how netpoll calls the poll functions
causes a differenet result.
netpoll will shortly call the napi poll routine with a budget 0 to
avoid any rx packets being processed. As nothing does that today
we may encounter drivers that have problems so a netpoll specific
warning seems desirable.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In poll_napi loop through all of the napi handlers even when the
budget falls to 0 to ensure that we process all of the tx_queues, and
so that we continue to call into drivers when our initial budget is 0.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves the control logic to the top level in netpoll_poll_dev
instead of having it dispersed throughout netpoll_poll_dev,
poll_napi and poll_one_napi.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Today netpoll depends on setting NETPOLL_RX_DROP before networking
drivers receive packets in interrupt context so that the packets can
be dropped. Move this setting into netpoll_poll_dev from
poll_one_napi so that if ndo_poll_controller happens to receive
packets we will drop the packets on the floor instead of letting the
packets bounce through the networking stack and potentially cause problems.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/usb/r8152.c
drivers/net/xen-netback/netback.c
Both the r8152 and netback conflicts were simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
consolidate duplicate code is skb_checksum_setup() helpers
Realizing that the skb_maybe_pull_tail() calls in the IP-protocol
specific portions of both helpers are terminal ones (i.e. no further
pulls are expected), their maximum size to be pulled can be made match
their minimal size needed, thus making the code identical and hence
possible to be moved into another helper.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We leak an active timer, the hotcpu notifier and all allocated
resources when we exit a namespace. Fix this by introducing a
flow_cache_fini() function where we release the resources before
we exit.
Fixes: ca925cf153 ("flowcache: Make flow cache name space aware")
Reported-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Jakub Kicinski <moorray3@wp.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of __constant_<foo> has been unnecessary for quite awhile now.
Make these uses consistent with the rest of the kernel.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_segment copies frags around, so we need
to copy them carefully to avoid accessing
user memory after reporting completion to userspace
through a callback.
skb_segment doesn't normally happen on datapath:
TSO needs to be disabled - so disabling zero copy
in this case does not look like a big deal.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
fskb is unrelated to frag: it's coming from
frag_list. Rename it list_skb to avoid confusion.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rename local variable to make it easier to tell at a glance that we are
dealing with a head skb.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_frag can in fact point at either skb
or fskb so rename it generally "frag".
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not legal to create multiple kmem_cache having the same name.
flowcache can use a single kmem_cache, no need for a per netns
one.
Fixes: ca925cf153 ("flowcache: Make flow cache name space aware")
Reported-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Jakub Kicinski <moorray3@wp.pl>
Tested-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/wireless/ath/ath9k/recv.c
drivers/net/wireless/mwifiex/pcie.c
net/ipv6/sit.c
The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.
The two wireless conflicts were overlapping changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
If the neigh table's entries is less than gc_thresh1, the function
will return directly, and the reachabletime will not be recompute,
so the reachabletime can be guessed.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Because those following if conditions will not be matched.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 arpd seems to expect this as there's code and comments
to handle netlink probes with NUD_PROBE set. It is used to flush
the arpd cached mappings.
opennhrp instead turns off unicast probes (so it can handle all
neighbour discovery). Without this change it will not see NUD_PROBE
probes and cannot reconfirm the mapping. Thus currently neigh entry
will just fail and can cause few packets dropped until broadcast
discovery is restarted.
Earlier discussion on the subject:
http://marc.info/?t=139305877100001&r=1&w=2
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sysfs file to enable user space to query the device
port number used by a netdevice instance. This is needed for
devices that have multiple ports on the same PCI function.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The documentation misses a few of the supported flags. Fix this. Also
respect the dependency to CONFIG_XFRM for the IPSEC flag.
Cc: Fan Du <fan.du@windriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'out' label is just a relict from previous times as pgctrl_write()
had multiple error paths. Get rid of it and simply return right away
on errors.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a privileged user writes an empty string to /proc/net/pktgen/pgctrl
the code for stripping the (then non-existent) '\n' actually writes the
zero byte at index -1 of data[]. The then still uninitialized array will
very likely fail the command matching tests and the pr_warning() at the
end will therefore leak stack bytes to the kernel log.
Fix those issues by simply ensuring we're passed a non-empty string as
the user API apparently expects a trailing '\n' for all commands.
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
1) Introduce skb_to_sgvec_nomark function to add further data to the sg list
without calling sg_unmark_end first. Needed to add extended sequence
number informations. From Fan Du.
2) Add IPsec extended sequence numbers support to the Authentication Header
protocol for ipv4 and ipv6. From Fan Du.
3) Make the IPsec flowcache namespace aware, from Fan Du.
4) Avoid creating temporary SA for every packet when no key manager is
registered. From Horia Geanta.
5) Support filtering of SA dumps to show only the SAs that match a
given filter. From Nicolas Dichtel.
6) Remove caching of xfrm_policy_sk_bundles. The cached socket policy bundles
are never used, instead we create a new cache entry whenever xfrm_lookup()
is called on a socket policy. Most protocols cache the used routes to the
socket, so this caching is not needed.
7) Fix a forgotten SADB_X_EXT_FILTER length check in pfkey, from Nicolas
Dichtel.
8) Cleanup error handling of xfrm_state_clone.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().
The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.
As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.
Lets rename the function and enhance the comments so that they reflect
these properties.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.
This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.
This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.
Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:
1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
in an object. It looks like a nice and convenient pattern at the first
sight because we can then retrieve the object from the IPI handler easily.
But actually it is a waste of memory space in the object since the csd
can be allocated from the stack by smp_call_function_single(wait = 1)
and the object can be passed an the IPI argument.
Besides that, embedding the csd in an object is more error prone
because the caller must take care of the serialization of the IPIs
for this csd.
2) The user calls __smp_call_function_single(wait = 1) on a csd that
is allocated on the stack. It's ok but smp_call_function_single()
can do it as well and it already takes care of the allocation on the
stack. Again it's more simple and less error prone.
Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.
There was a single user of that which has just been converted.
So lets remove this option to discourage further users.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch fixes bug introduced by:
commit 1d4c8c2984
"neigh: restore old behaviour of default parms values"
The thing is that in neigh_sysctl_register, extra1 and extra2 which were
previously set for NEIGH_VAR_GC_* are overwritten. That leads to
nonsense int limits for gc_* variables. So fix this by not touching
extra* fields for gc_* variables.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree,
they are:
* Fix nf_trace in nftables if XT_TRACE=n, from Florian Westphal.
* Don't use the fast payload operation in nf_tables if the length is
not power of 2 or it is not aligned, from Nikolay Aleksandrov.
* Fix missing break statement the inet flavour of nft_reject, which
results in evaluating IPv4 packets with the IPv6 evaluation routine,
from Patrick McHardy.
* Fix wrong kconfig symbol in nft_meta to match the routing realm,
from Paul Bolle.
* Allocate the NAT null binding when creating new conntracks via
ctnetlink to avoid that several packets race at initializing the
the conntrack NAT extension, original patch from Florian Westphal,
revisited version from me.
* Fix DNAT handling in the snmp NAT helper, the same handling was being
done for SNAT and DNAT and 2.4 already contains that fix, from
Francois-Xavier Le Bail.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix spelling typo in Documentation/DocBook.
It is because .html and .xml files are generated by make htmldocs,
I have to fix a typo within the source files.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Conflicts:
drivers/net/bonding/bond_3ad.h
drivers/net/bonding/bond_main.c
Two minor conflicts in bonding, both of which were overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
The only place this is used outside rtnetlink.c is veth. So provide
wrapper function for this usage.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get
lots of "TRACE: filter:output:policy:1 IN=..." warnings as several
places will leave skb->nf_trace uninitialised.
Unlike iptables tracing functionality is not conditional in nftables,
so always copy/zero nf_trace setting when nftables is enabled.
Move this into __nf_copy() helper.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In order to allow users to invoke netdev_cap_txqueue, it needs to
be moved into netdevice.h header file. While at it, also add kernel
doc header to document the API.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new argument for ndo_select_queue() callback that passes a
fallback handler. This gets invoked through netdev_pick_tx();
fallback handler is currently __netdev_pick_tx() as most drivers
invoke this function within their customized implementation in
case for skbs that don't need any special handling. This fallback
handler can then be replaced on other call-sites with different
queue selection methods (e.g. in packet sockets, pktgen etc).
This also has the nice side-effect that __netdev_pick_tx() is
then only invoked from netdev_pick_tx() and export of that
function to modules can be undone.
Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove one inline keyword, and no need for a loop to find
an index into a table.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of my pet coding style peeves is the practice of
adding extra return; at the end of function.
Kill several instances of this in network code.
I suppose some coccinelle wizardy could do this automatically.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Will be used by upcoming ipv4 forward path change that needs to
determine feature mask using skb->dst->dev instead of skb->dev.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
I saw the following BUG when ->newlink() fails in rtnl_newlink():
[ 40.240058] kernel BUG at net/core/dev.c:6438!
this is due to free_netdev() is not supposed to be called before
netdev is completely unregistered, therefore it is not correct
to call free_netdev() here, at least for ops->newlink!=NULL case,
many drivers call it in ->destructor so that rtnl_unlock() will
take care of it, we probably don't need to do anything here.
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
css. The intention of the interface is to make it easy to skip css's
(cgroup_subsys_states) which already match the migration target;
however, this is entirely unnecessary as migration taskset doesn't
include tasks which are already in the target cgroup. Drop @skip_css
from cgroup_taskset_for_each().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Inserting a entry into flowcache, or flushing flowcache should be based
on per net scope. The reason to do so is flushing operation from fat
netns crammed with flow entries will also making the slim netns with only
a few flow cache entries go away in original implementation.
Since flowcache is tightly coupled with IPsec, so it would be easier to
put flow cache global parameters into xfrm namespace part. And one last
thing needs to do is bumping flow cache genid, and flush flow cache should
also be made in per net style.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
As compared with skb_to_sgvec, skb_to_sgvec_nomark only map skb to given sglist
without mark the sg which contain last skb data as the end. So the caller can
mannipulate sg list as will when padding new data after the first call without
calling sg_unmark_end to expend sg list.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
ip rules with iif/oif references do not update:
(detach/attach) across interface renames.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
CC: Willem de Bruijn <willemb@google.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Chris Davis <chrismd@google.com>
CC: Carlo Contavalli <ccontavalli@google.com>
Google-Bug-Id: 12936021
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark functions as static in core/dev.c because they are not used outside
this file.
This eliminates the following warning in core/dev.c:
net/core/dev.c:2806:5: warning: no previous prototype for ‘__dev_queue_xmit’ [-Wmissing-prototypes]
net/core/dev.c:4640:5: warning: no previous prototype for ‘netdev_adjacent_sysfs_add’ [-Wmissing-prototypes]
net/core/dev.c:4650:6: warning: no previous prototype for ‘netdev_adjacent_sysfs_del’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup_subsys is a bit messier than it needs to be.
* The name of a subsys can be different from its internal identifier
defined in cgroup_subsys.h. Most subsystems use the matching name
but three - cpu, memory and perf_event - use different ones.
* cgroup_subsys_id enums are postfixed with _subsys_id and each
cgroup_subsys is postfixed with _subsys. cgroup.h is widely
included throughout various subsystems, it doesn't and shouldn't
have claim on such generic names which don't have any qualifier
indicating that they belong to cgroup.
* cgroup_subsys->subsys_id should always equal the matching
cgroup_subsys_id enum; however, we require each controller to
initialize it and then BUG if they don't match, which is a bit
silly.
This patch cleans up cgroup_subsys names and initialization by doing
the followings.
* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
cgroup_subsys with _cgrp_subsys.
* With the above, renaming subsys identifiers to match the userland
visible names doesn't cause any naming conflicts. All non-matching
identifiers are renamed to match the official names.
cpu_cgroup -> cpu
mem_cgroup -> memory
perf -> perf_event
* controllers no longer need to initialize ->subsys_id and ->name.
They're generated in cgroup core and set automatically during boot.
* Redundant cgroup_subsys declarations removed.
* While updating BUG_ON()s in cgroup_init_early(), convert them to
WARN()s. BUGging that early during boot is stupid - the kernel
can't print anything, even through serial console and the trap
handler doesn't even link stack frame properly for back-tracing.
This patch doesn't introduce any behavior changes.
v2: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
classid handling into core").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>
With module supported dropped from net_prio, no controller is using
cgroup module support. None of actual resource controllers can be
built as a module and we aren't gonna add new controllers which don't
control resources. This patch drops module support from cgroup.
* cgroup_[un]load_subsys() and cgroup_subsys->module removed.
* As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
cgroup_subsys.h now uses IS_ENABLED() directly.
* enum cgroup_subsys_id now exactly matches the list of enabled
controllers as ordered in cgroup_subsys.h.
* cgroup_subsys[] is now a contiguously occupied array. Size
specification is no longer necessary and dropped.
* for_each_builtin_subsys() is removed and for_each_subsys() is
updated to not require any locking.
* module ref handling is removed from rebind_subsystems().
* Module related comments dropped.
v2: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
classid handling into core").
v3: Added {} around the if (need_forkexit_callback) block in
cgroup_post_fork() for readability as suggested by Li.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
net_prio is the only cgroup which is allowed to be built as a module.
The savings from allowing one controller to be built as a module are
tiny especially given that cgroup module support itself adds quite a
bit of complexity.
Given that none of other controllers has much chance of being made a
module and that we're unlikely to add new modular controllers, the
added complexity is simply not justifiable.
As a first step to drop cgroup module support, this patch changes the
config option to bool from tristate and drops module related code from
it.
Also, while an earlier commit fe1217c4f3 ("net: net_cls: move
cgroupfs classid handling into core") dropped module support from
net_cls cgroup, it retained a call to cgroup_load_subsys(), which is
noop for built-in controllers. Drop it along with
init_netclassid_cgroup().
v2: Removed modular version of task_netprioidx() in
include/net/netprio_cgroup.h as suggested by Li Zefan.
v3: Rebased on top of fe1217c4f3 ("net: net_cls: move cgroupfs
classid handling into core"). net_cls cgroup part is mostly
dropped except for removal of init_netclassid_cgroup().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Thomas Graf <tgraf@suug.ch>
sock_alloc_send_pskb() & sk_page_frag_refill()
have a loop trying high order allocations to prepare
skb with low number of fragments as this increases performance.
Problem is that under memory pressure/fragmentation, this can
trigger OOM while the intent was only to try the high order
allocations, then fallback to order-0 allocations.
We had various reports from unexpected regressions.
According to David, setting __GFP_NORETRY should be fine,
as the asynchronous compaction is still enabled, and this
will prevent OOM from kicking as in :
CFSClientEventm invoked oom-killer: gfp_mask=0x42d0, order=3, oom_adj=0,
oom_score_adj=0, oom_score_badness=2 (enabled),memcg_scoring=disabled
CFSClientEventm
Call Trace:
[<ffffffff8043766c>] dump_header+0xe1/0x23e
[<ffffffff80437a02>] oom_kill_process+0x6a/0x323
[<ffffffff80438443>] out_of_memory+0x4b3/0x50d
[<ffffffff8043a4a6>] __alloc_pages_may_oom+0xa2/0xc7
[<ffffffff80236f42>] __alloc_pages_nodemask+0x1002/0x17f0
[<ffffffff8024bd23>] alloc_pages_current+0x103/0x2b0
[<ffffffff8028567f>] sk_page_frag_refill+0x8f/0x160
[<ffffffff80295fa0>] tcp_sendmsg+0x560/0xee0
[<ffffffff802a5037>] inet_sendmsg+0x67/0x100
[<ffffffff80283c9c>] __sock_sendmsg_nosec+0x6c/0x90
[<ffffffff80283e85>] sock_sendmsg+0xc5/0xf0
[<ffffffff802847b6>] __sys_sendmsg+0x136/0x430
[<ffffffff80284ec8>] sys_sendmsg+0x88/0x110
[<ffffffff80711472>] system_call_fastpath+0x16/0x1b
Out of Memory: Kill process 2856 (bash) score 9999 or sacrifice child
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, to make netconsole start over IPv6, the source address
needs to be specified. Without a source address, netpoll_parse_options
assumes we're setting up over IPv4 and the destination IPv6 address is
rejected.
Check if the IP version has been forced by a source address before
checking for a version mismatch when parsing the destination address.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>