Zerocopy payload is now always stored in frags, and space for headers
is reversed, so this memory is unused.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This avoids an unnecessary copy of 1-2KB and improves tso_fragment,
which has to fall back to tcp_fragment if skb->len != skb_data_len.
It also avoids a surprising inconsistency in notifications:
Zerocopy packets sent over loopback have their frags copied, so set
SO_EE_CODE_ZEROCOPY_COPIED in the notification. But this currently
does not happen for small packets, because when all data fits in the
linear fragment, data is not copied in skb_orphan_frags_rx.
Reported-by: Tom Deseyn <tom.deseyn@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Skbs that reach MAX_SKB_FRAGS cannot be extended further. Do the
same for zerocopy frags as non-zerocopy frags and set the PSH bit.
This improves GRO assembly.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a net-next follow-up to commit 268b790679 ("skbuff: orphan
frags before zerocopy clone"), which fixed a bug in net, but added a
call to skb_zerocopy_clone at each frag to do so.
When segmenting skbs with user frags, either the user frags must be
replaced with private copies and uarg released, or the uarg must have
its refcount increased for each new skb.
skb_orphan_frags does the first, except for cases that can handle
reference counting. skb_zerocopy_clone then does the second.
Call these once per nskb, instead of once per frag.
That is, in the common case. With a frag list, also refresh when the
origin skb (frag_skb) changes.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The patch(180d8cd942) replaces all uses of struct sock fields'
memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem
to accessor macros. But the sockets_allocated field of sctp sock is
not replaced at all. Then replace it now for unifying the code.
Fixes: 180d8cd942 ("foundations of per-cgroup memory pressure controlling.")
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If kmem_cache_alloc() fails in the middle of the for() loop,
cleanup anything that might have been allocated so far.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f10b4cff98 ("rds: tcp: atomically purge entries from
rds_tcp_conn_list during netns delete") adds the field t_tcp_detached,
but this needs to be initialized explicitly to false.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the rds_sock is not added to the bind_hash_table, we must
reset rs_bound_addr so that rds_remove_bound will not trip on
this rds_sock.
rds_add_bound() does a rds_sock_put() in this failure path, so
failing to reset rs_bound_addr will result in a socket refcount
bug, and will trigger a WARN_ON with the stack shown below when
the application subsequently tries to close the PF_RDS socket.
WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \
rds_sock_destruct+0x15/0x30 [rds]
:
__sk_destruct+0x21/0x190
rds_remove_bound.part.13+0xb6/0x140 [rds]
rds_release+0x71/0x120 [rds]
sock_release+0x1a/0x70
sock_close+0xe/0x20
__fput+0xd5/0x210
task_work_run+0x82/0xa0
do_exit+0x2ce/0xb30
? syscall_trace_enter+0x1cc/0x2b0
do_group_exit+0x39/0xa0
SyS_exit_group+0x10/0x10
do_syscall_64+0x61/0x1a0
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce peer_offset parameter in order to add the capability
to specify two different values for payload offset on tx/rx side.
If just offset is provided by userspace use it for rx side as well
in order to maintain compatibility with older l2tp versions
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Report offset parameter in L2TP_CMD_SESSION_GET command if
it has been configured by userspace
Fixes: 309795f4be ("l2tp: Add netlink control API for L2TP")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2017-12-22
1) Separate ESP handling from segmentation for GRO packets.
This unifies the IPsec GSO and non GSO codepath.
2) Add asynchronous callbacks for xfrm on layer 2. This
adds the necessary infrastructure to core networking.
3) Allow to use the layer2 IPsec GSO codepath for software
crypto, all infrastructure is there now.
4) Also allow IPsec GSO with software crypto for local sockets.
5) Don't require synchronous crypto fallback on IPsec offloading,
it is not needed anymore.
6) Check for xdo_dev_state_free and only call it if implemented.
From Shannon Nelson.
7) Check for the required add and delete functions when a driver
registers xdo_dev_ops. From Shannon Nelson.
8) Define xfrmdev_ops only with offload config.
From Shannon Nelson.
9) Update the xfrm stats documentation.
From Shannon Nelson.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2017-12-22
1) Check for valid id proto in validate_tmpl(), otherwise
we may trigger a warning in xfrm_state_fini().
From Cong Wang.
2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute.
From Michal Kubecek.
3) Verify the state is valid when encap_type < 0,
otherwise we may crash on IPsec GRO .
From Aviv Heller.
4) Fix stack-out-of-bounds read on socket policy lookup.
We access the flowi of the wrong address family in the
IPv4 mapped IPv6 case, fix this by catching address
family missmatches before we do the lookup.
5) fix xfrm_do_migrate() with AEAD to copy the geniv
field too. Otherwise the state is not fully initialized
and migration fails. From Antony Antony.
6) Fix stack-out-of-bounds with misconfigured transport
mode policies. Our policy template validation is not
strict enough. It is possible to configure policies
with transport mode template where the address family
of the template does not match the selectors address
family. Fix this by refusing such a configuration,
address family can not change on transport mode.
7) Fix a policy reference leak when reusing pcpu xdst
entry. From Florian Westphal.
8) Reinject transport-mode packets through tasklet,
otherwise it is possible to reate a recursion
loop. From Herbert Xu.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix memory leak in tipc_enable_bearer() if enable_media() fails, and
cleanup with bearer_disable() if tipc_mon_create() fails.
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS currently doesn't check if the length of the control message is
large enough to hold the required data, before dereferencing the control
message data. This results in following crash:
BUG: KASAN: stack-out-of-bounds in rds_rdma_bytes net/rds/send.c:1013
[inline]
BUG: KASAN: stack-out-of-bounds in rds_sendmsg+0x1f02/0x1f90
net/rds/send.c:1066
Read of size 8 at addr ffff8801c928fb70 by task syzkaller455006/3157
CPU: 0 PID: 3157 Comm: syzkaller455006 Not tainted 4.15.0-rc3+ #161
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
rds_rdma_bytes net/rds/send.c:1013 [inline]
rds_sendmsg+0x1f02/0x1f90 net/rds/send.c:1066
sock_sendmsg_nosec net/socket.c:628 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:638
___sys_sendmsg+0x320/0x8b0 net/socket.c:2018
__sys_sendmmsg+0x1ee/0x620 net/socket.c:2108
SYSC_sendmmsg net/socket.c:2139 [inline]
SyS_sendmmsg+0x35/0x60 net/socket.c:2134
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x43fe49
RSP: 002b:00007fffbe244ad8 EFLAGS: 00000217 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fe49
RDX: 0000000000000001 RSI: 000000002020c000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000217 R12: 00000000004017b0
R13: 0000000000401840 R14: 0000000000000000 R15: 0000000000000000
To fix this, we verify that the cmsg_len is large enough to hold the
data to be read, before proceeding further.
Reported-by: syzbot <syzkaller-bugs@googlegroups.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and
since we've checked 'tun_dst', 'md' will never be NULL.
The patch removes it at both ipv4 and ipv6 erspan.
Fixes: afb4c97d90 ("ip6_gre: fix potential memory leak in ip6erspan_rcv")
Fixes: 50670b6ee9 ("ip_gre: fix potential memory leak in erspan_rcv")
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dereference tp->md5sig_info in tcp_v4_destroy_sock() the same way it is
done in the adjacent call to tcp_clear_md5_list().
Resolves this sparse warning:
net/ipv4/tcp_ipv4.c:1914:17: warning: incorrect type in argument 1 (different address spaces)
net/ipv4/tcp_ipv4.c:1914:17: expected struct callback_head *head
net/ipv4/tcp_ipv4.c:1914:17: got struct callback_head [noderef] <asn:4>*<noident>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If SNAT modifies the source address the resulting packet might match
an IPsec policy, reinject the packet if that's the case.
The exact same thing is already done for IPv4.
Signed-off-by: Tobias Brunner <tobias@strongswan.org>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a group member receives a member WITHDRAW event, this might have
two reasons: either the peer member is leaving the group, or the link
to the member's node has been lost.
In the latter case we need to issue a DOWN event to the user right away,
and let function tipc_group_filter_msg() perform delete of the member
item. However, in this case we miss to change the state of the member
item to MBR_LEAVING, so the member item is not deleted, and we have a
memory leak.
We now separate better between the four sub-cases of a WITHRAW event
and make sure that each case is handled correctly.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to check block for being null in both tcf_block_put and
tcf_block_put_ext.
Fixes: 343723dd51 ("net: sched: fix clsact init error path")
Reported-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 2f487712b8 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") we introduced a mechanism that requires the first
(replicated) broadcast sent after a unicast to be acknowledged by all
receivers before permitting sending of the next (true) broadcast.
The counter for keeping track of the number of acknowledges to expect
is based on the tipc_group::member_cnt variable. But this misses that
some of the known members may not be ready for reception, and will never
acknowledge the message, either because they haven't fully joined the
group or because they are leaving the group. Such members are identified
by not fulfilling the condition tested for in the function
tipc_group_is_enabled().
We now set the counter for the actual number of acks to receive at the
moment the message is sent, by just counting the number of recipients
satisfying the tipc_group_is_enabled() test.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
flying RCU callback installed by a previous mini_qdisc_pair_swap(),
however we miss it on the tp_head==NULL path, which leads to that
the RCU callback still uses miniq_old->rcu after it is freed together
with qdisc in qdisc_graft(). So just add it on that path too.
Fixes: 46209401f8 ("net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath ")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When ip6gre is created using ioctl, its features, such as
scatter-gather, GSO and tx-checksumming will be turned off:
# ip -f inet6 tunnel add gre6 mode ip6gre remote fd00::1
# ethtool -k gre6 (truncated output)
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
generic-segmentation-offload: off [requested on]
But when netlink is used, they will be enabled:
# ip link add gre6 type ip6gre remote fd00::1
# ethtool -k gre6 (truncated output)
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
generic-segmentation-offload: on
This results in a loss of performance when gre6 is created via ioctl.
The issue was found with LTP/gre tests.
Fix it by moving the setup of device features to a separate function
and invoke it with ndo_init callback because both netlink and ioctl
will eventually call it via register_netdevice():
register_netdevice()
- ndo_init() callback -> ip6gre_tunnel_init() or ip6gre_tap_init()
- ip6gre_tunnel_init_common()
- ip6gre_tnl_init_features()
The moved code also contains two minor style fixes:
* removed needless tab from GRE6_FEATURES on NETIF_F_HIGHDMA line.
* fixed the issue reported by checkpatch: "Unnecessary parentheses around
'nt->encap.type == TUNNEL_ENCAP_NONE'"
Fixes: ac4eb009e4 ("ip6gre: Add support for basic offloads offloads excluding GSO")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lots of overlapping changes. Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.
Include a necessary change by Jakub Kicinski, with log message:
====================
cls_bpf no longer takes care of offload tracking. Make sure
netdevsim performs necessary checks. This fixes a warning
caused by TC trying to remove a filter it has not added.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The batman-adv unicast packets contain a full layer 2 frame in encapsulated
form. The flow dissector must therefore be able to parse the batman-adv
unicast header to reach the layer 2+3 information.
+--------------------+
| ip(v6)hdr |
+--------------------+
| inner ethhdr |
+--------------------+
| batadv unicast hdr |
+--------------------+
| outer ethhdr |
+--------------------+
The obtained information from the upper layer can then be used by RPS to
schedule the processing on separate cores. This allows better distribution
of multiple flows from the same neighbor to different cores.
Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The header file is used by different userspace programs to inject packets
or to decode sniffed packets. It should therefore be available to them as
userspace header.
Also other components in the kernel (like the flow dissector) require
access to the packet definitions to be able to decode ETH_P_BATMAN ethernet
packets.
Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The uapi headers use the __u8/__u16/... version of the fixed width types
instead of u8/u16/... The use of the latter must be avoided before
packet.h is copied to include/uapi/linux/.
Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The BIT(x) macro is no longer available for uapi headers because it is
defined outside of it (linux/bitops.h). The use of it must therefore be
avoided and replaced by an appropriate other representation.
Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The headers used by packet.h should also be included by it directly. main.h
is currently dealing with it in batman-adv, but this will no longer work
when this header is moved to include/uapi/linux/.
Signed-off-by: Sven Eckelmann <sven.eckelmann@openmesh.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_copy_ubufs creates a private copy of frags[] to release its hold
on user frags, then calls uarg->callback to notify the owner.
Call uarg->callback even when no frags exist. This edge case can
happen when zerocopy_sg_from_iter finds enough room in skb_headlen
to copy all the data.
Fixes: 3ece782693 ("sock: skb_copy_ubufs support for compound pages")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate
calls to skb_uarg(skb)->callback for the same data.
skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb
with each segment. This is only safe for uargs that do refcounting,
which is those that pass skb_orphan_frags without dropping their
shared frags. For others, skb_orphan_frags drops the user frags and
sets the uarg to NULL, after which sock_zerocopy_clone has no effect.
Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback
calls for the same data causing the vhost_net_ubuf_ref_>refcount to
drop below zero.
Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com>
Fixes: 1f8b977ab3 ("sock: enable MSG_ZEROCOPY")
Reported-by: Andreas Hartmann <andihartmann@01019freenet.de>
Reported-by: David Hill <dhill@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl.ip6.auto_flowlabels is default 1. In our hosts, we set it to 2.
If sockopt doesn't set autoflowlabel, outcome packets from the hosts are
supposed to not include flowlabel. This is true for normal packet, but
not for reset packet.
The reason is ipv6_pinfo.autoflowlabel is set in sock creation. Later if
we change sysctl.ip6.auto_flowlabels, the ipv6_pinfo.autoflowlabel isn't
changed, so the sock will keep the old behavior in terms of auto
flowlabel. Reset packet is suffering from this problem, because reset
packet is sent from a special control socket, which is created at boot
time. Since sysctl.ipv6.auto_flowlabels is 1 by default, the control
socket will always have its ipv6_pinfo.autoflowlabel set, even after
user set sysctl.ipv6.auto_flowlabels to 1, so reset packset will always
have flowlabel. Normal sock created before sysctl setting suffers from
the same issue. We can't even turn off autoflowlabel unless we kill all
socks in the hosts.
To fix this, if IPV6_AUTOFLOWLABEL sockopt is used, we use the
autoflowlabel setting from user, otherwise we always call
ip6_default_np_autolabel() which has the new settings of sysctl.
Note, this changes behavior a little bit. Before commit 42240901f7
(ipv6: Implement different admin modes for automatic flow labels), the
autoflowlabel behavior of a sock isn't sticky, eg, if sysctl changes,
existing connection will change autoflowlabel behavior. After that
commit, autoflowlabel behavior is sticky in the whole life of the sock.
With this patch, the behavior isn't sticky again.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb_vlan_pop() expects skb->protocol to be a valid TPID for double
tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
shift the true ethertype into position for us.
Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets")
Signed-off-by: Eric Garver <e@erig.me>
Reviewed-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the drr qdisc implementation by
adding NL_SET_ERR_MSG in validation of user input.
Also it serves to illustrate a use case of how the infrastructure ops
api changes are to be used by individual qdiscs.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the cbs qdisc implementation by
adding NL_SET_ERR_MSG in validation of user input.
Also it serves to illustrate a use case of how the infrastructure ops
api changes are to be used by individual qdiscs.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the cbq qdisc implementation by
adding NL_SET_ERR_MSG in validation of user input.
Also it serves to illustrate a use case of how the infrastructure ops
api changes are to be used by individual qdiscs.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the function qdisc_create_dflt which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_create_dflt failed. The function qdisc_create_dflt
will also call an init callback which can fail by any per-qdisc specific
handling.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the function qdisc_alloc which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_alloc failed.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the function tcf_block_get which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why tcf_block_get failed.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for the function qdisc_get_rtab which is
a common used function in the tc subsystem. Callers which are interested
in the receiving error can assign extack to get a more detailed
information why qdisc_get_rtab failed.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for graft callback to prepare per-qdisc
specific changes for extack.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for block callback to prepare per-qdisc
specific changes for extack.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for class change callback api. This prepares
to handle extack support inside each specific class implementation.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for change callback for qdisc ops
structtur to prepare per-qdisc specific changes for extack.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for init callback to prepare per-qdisc
specific changes for extack.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds extack support for generic qdisc handling. The extack
will be set deeper to each called function which is not part of netdev
core api.
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fix checkpatch issues for upcomming patches according to the
sched api file. It changes mostly how to check on null pointer.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, parameters such as oif and source address are not taken into
account during fibmatch lookup. Example (IPv4 for reference) before
patch:
$ ip -4 route show
192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
198.51.100.0/24 dev dummy1 proto kernel scope link src 198.51.100.1
$ ip -6 route show
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev dummy1 proto kernel metric 256 pref medium
fe80::/64 dev dummy0 proto kernel metric 256 pref medium
fe80::/64 dev dummy1 proto kernel metric 256 pref medium
$ ip -4 route get fibmatch 192.0.2.2 oif dummy0
192.0.2.0/24 dev dummy0 proto kernel scope link src 192.0.2.1
$ ip -4 route get fibmatch 192.0.2.2 oif dummy1
RTNETLINK answers: No route to host
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
After:
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy0
2001:db8:1::/64 dev dummy0 proto kernel metric 256 pref medium
$ ip -6 route get fibmatch 2001:db8:1::2 oif dummy1
RTNETLINK answers: Network is unreachable
The problem stems from the fact that the necessary route lookup flags
are not set based on these parameters.
Instead of duplicating the same logic for fibmatch, we can simply
resolve the original route from its copy and dump it instead.
Fixes: 18c3a61c42 ("net: ipv6: RTM_GETROUTE: return matched fib result when requested")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds a check for the required add and delete functions up front
at registration time to be sure both are defined.
Since both the features check and the registration check are looking
at the same things, break out the check for both to call.
Lastly, for some reason the feature check was setting xfrmdev_ops to
NULL if the NETIF_F_HW_ESP bit was missing, which would probably
surprise the driver later if the driver turned its NETIF_F_HW_ESP bit
back on. We shouldn't be messing with the driver's callback list, so
we stop doing that with this patch.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Since commit 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse") the
local table uses the same trie allocated for the main table when custom
rules are not in use.
When a net namespace is dismantled, the main table is flushed and freed
(via an RCU callback) before the local table. In case the callback is
invoked before the local table is iterated, a use-after-free can occur.
Fix this by iterating over the FIB tables in reverse order, so that the
main table is always freed after the local table.
v3: Reworded comment according to Alex's suggestion.
v2: Add a comment to make the fix more explicit per Dave's and Alex's
feedback.
Fixes: 0ddcf43d5d ("ipv4: FIB Local/MAIN table collapse")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we receive a JOIN message from a peer member, the message may
contain an advertised window value ADV_IDLE that permits removing the
member in question from the tipc_group::congested list. However, since
the removal has been made conditional on that the advertised window is
*not* ADV_IDLE, we miss this case. This has the effect that a sender
sometimes may enter a state of permanent, false, broadcast congestion.
We fix this by unconditinally removing the member from the congested
list before calling tipc_member_update(), which might potentially sort
it into the list again.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- bump version strings, by Simon Wunderlich
- de-inline hash functions to save memory footprint, by Denys Vlasenko
- Add License information to various files, by Sven Eckelmann (3 patches)
- Change batman_adv.h from ISC to MIT, by Sven Eckelmann
- Improve various includes, by Sven Eckelmann (5 patches)
- Lots of kernel-doc work by Sven Eckelmann (8 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAlo6QfgWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeobWfEADPEOdxWS7nW4Xkhug+7vLbcloJ
Om1VDKFD4n5NfB6e+vh8kQAnnnQ/LzFSv53giNdnjjE9IPNKxNhzBQFS95H189EP
ebP0mKOesTadkTx+MjFxenhnaTnzK0hkngdxz/frvaq+i6ECMhnq8Bw0elVG0nSg
X9ts0x6BuNIw6EjIdPP0GvfOV1DvUmdMz1YLJy4yoJ8Tm671y86y07jTJaaEN2Ex
9dp0j3hEqtrYZvgRdQ/hzYLFJ9fvcF1FyA+duufK3FNbkJtZn89zqseIuN2saqqw
QN/nDBrduzW6SR9y0JfXlatI6FN6316jskLcpqorz5/88KrwQeTOg2ZXXn0iw6l1
r0tkBP5/eEu2Dcd3WRsNtTnMZmGfc2uuqmvD1Pz0wN7RAzIYhPvs9to6TVy3mK5p
0OyFJaZU9vurtIPqVYjNtSofkUHKMOZZ1H7LocWaINGIVsREx/i0DmX//M0ZiqbB
mS4ybkn8yOYfFUizg51RYo2nKVyw/ZXNqBYBPUWio91CPj0vOUFitvfxnxNitV/m
182HVdoOnGhorYbS3J/8Su3AyEyhJTVWnmq0z3u1CuvdMhppbAzvXvmERotAJN9e
5Wp26PE2a7zb4LJhMBQ4q+RnUnZV5ADhigrIEGPOQoY5KdIrEOcW65wrgfhWYlER
NVJc2okVZKhg3o7amA==
=5JUK
-----END PGP SIGNATURE-----
Merge tag 'batadv-next-for-davem-20171220' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
- de-inline hash functions to save memory footprint, by Denys Vlasenko
- Add License information to various files, by Sven Eckelmann (3 patches)
- Change batman_adv.h from ISC to MIT, by Sven Eckelmann
- Improve various includes, by Sven Eckelmann (5 patches)
- Lots of kernel-doc work by Sven Eckelmann (8 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With changes in inet_ files, SCTP state transitions are traced with
inet_sock_set_state tracepoint.
As SCTP state names, i.e. SCTP_SS_CLOSED, SCTP_SS_ESTABLISHED,
have the same value with TCP state names. So the output info still print
the TCP state names, that makes the code easy.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With changes in inet_ files, DCCP state transitions are traced with
inet_sock_set_state tracepoint.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk_state_load is only used by AF_INET/AF_INET6, so rename it to
inet_sk_state_load and move it into inet_sock.h.
sk_state_store is removed as it is not used any more.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As sk_state is a common field for struct sock, so the state
transition tracepoint should not be a TCP specific feature.
Currently it traces all AF_INET state transition, so I rename this
tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
into trace/events/sock.h.
We dont need to create a file named trace/events/inet_sock.h for this one single
tracepoint.
Two helpers are introduced to trace sk_state transition
- void inet_sk_state_store(struct sock *sk, int newstate);
- void inet_sk_set_state(struct sock *sk, int state);
As trace header should not be included in other header files,
so they are defined in sock.c.
The protocol such as SCTP maybe compiled as a ko, hence export
inet_sk_set_state().
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If md is NULL, tun_dst must be freed, otherwise it will cause memory
leak.
Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If md is NULL, tun_dst must be freed, otherwise it will cause memory
leak.
Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same as ipv4 code, when ip6erspan_rcv call return PACKET_REJECT, we
should call icmpv6_send to send icmp unreachable message in error path.
Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Acked-by: William Tu <u9012063@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When erspan_rcv call return PACKET_REJECT, we shoudn't call ipgre_rcv to
process packets again, instead send icmp unreachable message in error
path.
Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Acked-by: William Tu <u9012063@gmail.com>
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pskb_may_pull() can change skb->data, so we need to load ipv6h/ershdr at
the right place.
Fixes: 5a963eb61b ("ip6_gre: Add ERSPAN native tunnel support")
Cc: William Tu <u9012063@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cls_bpf used to take care of tracking what offload state a filter
is in, i.e. it would track if offload request succeeded or not.
This information would then be used to issue correct requests to
the driver, e.g. requests for statistics only on offloaded filters,
removing only filters which were offloaded, using add instead of
replace if previous filter was not added etc.
This tracking of offload state no longer functions with the new
callback infrastructure. There could be multiple entities trying
to offload the same filter.
Throw out all the tracking and corresponding commands and simply
pass to the drivers both old and new bpf program. Drivers will
have to deal with offload state tracking by themselves.
Fixes: 3f7889c4c7 ("net: sched: cls_bpf: call block callbacks for offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use
%pM to print MAC
mac_pton() to convert it from ASCII to binary format, and
ether_addr_copy() to copy.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(I can trivially verify that that idr_remove in cleanup_net happens
after the network namespace count has dropped to zero --EWB)
Function get_net_ns_by_id() does not check for net::count
after it has found a peer in netns_ids idr.
It may dereference a peer, after its count has already been
finaly decremented. This leads to double free and memory
corruption:
put_net(peer) rtnl_lock()
atomic_dec_and_test(&peer->count) [count=0] ...
__put_net(peer) get_net_ns_by_id(net, id)
spin_lock(&cleanup_list_lock)
list_add(&net->cleanup_list, &cleanup_list)
spin_unlock(&cleanup_list_lock)
queue_work() peer = idr_find(&net->netns_ids, id)
| get_net(peer) [count=1]
| ...
| (use after final put)
v ...
cleanup_net() ...
spin_lock(&cleanup_list_lock) ...
list_replace_init(&cleanup_list, ..) ...
spin_unlock(&cleanup_list_lock) ...
... ...
... put_net(peer)
... atomic_dec_and_test(&peer->count) [count=0]
... spin_lock(&cleanup_list_lock)
... list_add(&net->cleanup_list, &cleanup_list)
... spin_unlock(&cleanup_list_lock)
... queue_work()
... rtnl_unlock()
rtnl_lock() ...
for_each_net(tmp) { ...
id = __peernet2id(tmp, peer) ...
spin_lock_irq(&tmp->nsid_lock) ...
idr_remove(&tmp->netns_ids, id) ...
... ...
net_drop_ns() ...
net_free(peer) ...
} ...
|
v
cleanup_net()
...
(Second free of peer)
Also, put_net() on the right cpu may reorder with left's cpu
list_replace_init(&cleanup_list, ..), and then cleanup_list
will be corrupted.
Since cleanup_net() is executed in worker thread, while
put_net(peer) can happen everywhere, there should be
enough time for concurrent get_net_ns_by_id() to pick
the peer up, and the race does not seem to be unlikely.
The patch fixes the problem in standard way.
(Also, there is possible problem in peernet2id_alloc(), which requires
check for net::count under nsid_lock and maybe_get_net(peer), but
in current stable kernel it's used under rtnl_lock() and it has to be
safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
be fixed too. While this is not in stable kernel yet, so I'll send
a separate message to netdev@ later).
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 0c7aecd4bd "netns: add rtnl cmd to add and get peer netns ids"
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LTP/udp6_ipsec_vti tests fail when sending large UDP datagrams over
ip6_vti that require fragmentation and the underlying device has an
MTU smaller than 1500 plus some extra space for headers. This happens
because ip6_vti, by default, sets MTU to ETH_DATA_LEN and not updating
it depending on a destination address or link parameter. Further
attempts to send UDP packets may succeed because pmtu gets updated on
ICMPV6_PKT_TOOBIG in vti6_err().
In case the lower device has larger MTU size, e.g. 9000, ip6_vti works
but not using the possible maximum size, output packets have 1500 limit.
The above cases require manual MTU setup after ip6_vti creation. However
ip_vti already updates MTU based on lower device with ip_tunnel_bind_dev().
Here is the example when the lower device MTU is set to 9000:
# ip a sh ltp_ns_veth2
ltp_ns_veth2@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 ...
inet 10.0.0.2/24 scope global ltp_ns_veth2
inet6 fd00::2/64 scope global
# ip li add vti6 type vti6 local fd00::2 remote fd00::1
# ip li show vti6
vti6@NONE: <POINTOPOINT,NOARP> mtu 1500 ...
link/tunnel6 fd00::2 peer fd00::1
After the patch:
# ip li add vti6 type vti6 local fd00::2 remote fd00::1
# ip li show vti6
vti6@NONE: <POINTOPOINT,NOARP> mtu 8832 ...
link/tunnel6 fd00::2 peer fd00::1
Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We support asynchronous crypto on layer 2 ESP now.
So no need to force synchronous crypto fallback on
offloading anymore.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We now have support for asynchronous crypto operations in the layer 2 TX
path. This was the missing part to allow the GSO codepath for software
crypto, so allow this codepath now.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This patch implements asynchronous crypto callbacks
and a backlog handler that can be used when IPsec
is done at layer 2 in the TX path. It also extends
the skb validate functions so that we can update
the driver transmit return codes based on async
crypto operation or to indicate that we queued the
packet in a backlog queue.
Joint work with: Aviv Heller <avivh@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We change the ESP GSO handlers to only segment the packets.
The ESP handling and encryption is defered to validate_xmit_xfrm()
where this is done for non GRO packets too. This makes the code
more robust and prepares for asynchronous crypto handling.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The recently added fib_metrics_match() causes a regression for routes
with both RTAX_FEATURES and RTAX_CC_ALGO if the latter has
TCP_CONG_NEEDS_ECN flag set:
| # ip link add d0 type dummy
| # ip link set d0 up
| # ip route add 172.29.29.0/24 dev d0 features ecn congctl dctcp
| # ip route del 172.29.29.0/24 dev d0 features ecn congctl dctcp
| RTNETLINK answers: No such process
During route insertion, fib_convert_metrics() detects that the given CC
algo requires ECN and hence sets DST_FEATURE_ECN_CA bit in
RTAX_FEATURES.
During route deletion though, fib_metrics_match() compares stored
RTAX_FEATURES value with that from userspace (which obviously has no
knowledge about DST_FEATURE_ECN_CA) and fails.
Fixes: 5f9ae3d9e7 ("ipv4: do metrics match when looking up and deleting a route")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
First, the check of &q->ring.queue against NULL is wrong, it
is always false. We should check the value rather than the address.
Secondly, we need the same check in pfifo_fast_reset() too,
as both ->reset() and ->destroy() are called in qdisc_destroy().
Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When, during a join operation, or during message transmission, a group
member needs to be added to the group's 'congested' list, we sort it
into the list in ascending order, according to its current advertised
window size. However, we miss the case when the member is already on
that list. This will have the result that the member, after the window
size has been decremented, might be at the wrong position in that list.
This again may have the effect that we during broadcast and multicast
transmissions miss the fact that a destination is not yet ready for
reception, and we end up sending anyway. From this point on, the
behavior during the remaining session is unpredictable, e.g., with
underflowing window sizes.
We now correct this bug by unconditionally removing the member from
the list before (re-)sorting it in.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2017-12-18
Here's the first bluetooth-next pull request for the 4.16 kernel.
- hci_ll: multiple cleanups & fixes
- Remove Gustavo Padovan from the MAINTAINERS file
- Support BLE Adversing while connected (if the controller can do it)
- DT updates for TI chips
- Various other smaller cleanups & fixes
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now it's using IPV6_MIN_MTU as the min mtu in ip6_tnl_xmit, but
IPV6_MIN_MTU actually only works when the inner packet is ipv6.
With IPV6_MIN_MTU for ipv4 packets, the new pmtu for inner dst
couldn't be set less than 1280. It would cause tx_err and the
packet to be dropped when the outer dst pmtu is close to 1280.
Jianlin found it by running ipv4 traffic with the topo:
(client) gre6 <---> eth1 (route) eth2 <---> gre6 (server)
After changing eth2 mtu to 1300, the performance became very
low, or the connection was even broken. The issue also affects
ip4ip6 and ip6ip6 tunnels.
So if the inner packet is ipv4, 576 should be considered as the
min mtu.
Note that for ip4ip6 and ip6ip6 tunnels, the inner packet can
only be ipv4 or ipv6, but for gre6 tunnel, it may also be ARP.
This patch using 576 as the min mtu for non-ipv6 packet works
for all those cases.
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The same fix as the patch "ip_gre: remove the incorrect mtu limit for
ipgre tap" is also needed for ip6_gre.
Fixes: 61e84623ac ("net: centralize net_device min/max MTU checking")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipgre tap driver calls ether_setup(), after commit 61e84623ac
("net: centralize net_device min/max MTU checking"), the range
of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
It causes the dev mtu of the ipgre tap device to not be greater
than 1500, this limit value is not correct for ipgre tap device.
Besides, it's .change_mtu already does the right check. So this
patch is just to set max_mtu as 0, and leave the check to it's
.change_mtu.
Fixes: 61e84623ac ("net: centralize net_device min/max MTU checking")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hardware should not aggregate any packets when generic XDP is installed.
Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
GRO. With this flag, we can now independently turn on or off hardware
GRO when GRO is on. Previously, drivers were using NETIF_F_GRO to
control hardware GRO and so it cannot be independently turned on or
off without affecting GRO.
Hardware GRO (just like GRO) guarantees that packets can be re-segmented
by TSO/GSO to reconstruct the original packet stream. Logically,
GRO_HW should depend on GRO since it a subset, but we will let
individual drivers enforce this dependency as they see fit.
Since NETIF_F_GRO is not propagated between upper and lower devices,
NETIF_F_GRO_HW should follow suit since it is a subset of GRO. In other
words, a lower device can independent have GRO/GRO_HW enabled or disabled
and no feature propagation is required. This will preserve the current
GRO behavior. This can be changed later if we decide to propagate GRO/
GRO_HW/RXCSUM from upper to lower devices.
Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some case, we want to know how many sockets are in use in
different _net_ namespaces. It's a key resource metric.
This patch add a member in struct netns_core. This is a counter
for socket-inuse in the _net_ namespace. The patch will add/sub
counter in the sk_alloc, sk_clone_lock and __sk_free.
This patch will not counter the socket created in kernel.
It's not very useful for userspace to know how many kernel
sockets we created.
The main reasons for doing this are that:
1. When linux calls the 'do_exit' for process to exit, the functions
'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
'exit_task_namespaces' may have destroyed the _net_ namespace, but
'sock_release' called in 'exit_task_work' may use the _net_ namespace
if we counter the socket-inuse in sock_release.
2. socket and sock are in pair. More important, sock holds the _net_
namespace. We counter the socket-inuse in sock, for avoiding holding
_net_ namespace again in socket. It's a easy way to maintain the code.
Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the member name will make the code more readable.
This patch will be used in next patch.
Signed-off-by: Martin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* hwsim:
- set To-DS bit in some frames missing it
- fix sleeping in atomic
* nl80211:
- doc cleanup
- fix locking in an error path
* build:
- don't append to created certs C files
- ship certificate pre-hexdumped
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlo44hIACgkQB8qZga/f
l8RSchAAlk1jd9WTkvzQEUe3KpQ9cPoLe6SZ5ozPJCccJQ4GPlQiB9NK1hi+qfv/
WWdJl7zSWMujVtneeEhxHFnlwhRGh3s1t9IP1RACPEhlvFbyKob55cXJnbbiRtCJ
MEwW15c/SbCND79QRVXLQxV0JiNM0I8j1LIdEWoi2xFA+/g+zA7InZ1g/WvPUQq7
N28bGJ4lVEmBFJ1nnaF4kFpGRn73Cq2E9CqkWtS3Gfo9Gso6XCMvmPRjO2+BzJ67
Ovmy9jEJTZO9aGOUXtcaZbeNhlxJMqVdRv2mjpR/l9g6r2/u+BD4uiAlLy+RNcRU
gABqk3RJl+CA7BDl+YJahm33rLhokkZZIwq7UV5jsYOR1UTUdK/j7s5MpoqaSIVI
ZA5/emcEoY8XTt+MMCEJbhRIJA9rQXKOq6vdeqLK5YBL0HKlg5eIWtGqh3EcY6iO
+qe0cc4TFp4sWzOOrKShmDh5agDta1+gHUoBQsTBZnnU6pEybcY2FTUF+xN4u3mk
cEOOMFX8CiMDIyCf2GUALRbnwfZgWesxjfY7NXCAOL6T7JK+WFVX9Njn2JqT/eIO
FKUZl29nathswPTb1YdHmoAUJPU3TicuVnVPJzkJVG0lN5x7pwFX7/nkrZxphCBT
06t0r3RohfbMGwYTVXqi7z7FkDVr2YbP9fQ63r5yvWOgPCT+xLM=
=OWs9
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2017-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A few more fixes:
* hwsim:
- set To-DS bit in some frames missing it
- fix sleeping in atomic
* nl80211:
- doc cleanup
- fix locking in an error path
* build:
- don't append to created certs C files
- ship certificate pre-hexdumped
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit enhances the scan results to report the per chain signal
strength based on the latest BSS update. This provides similar
information to what is already available through STA information.
Signed-off-by: Sunil Dutt <usdutt@qti.qualcomm.com>
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Send disconnection reason code to user space even if it's locally
generated, since some tests that check reason code may fail because of
the current behavior.
Signed-off-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit e937b8da5a.
Turns out that a new driver (mt76) is coming in through
Kalle's tree, and will conflict with this. It also has some
conflicting requirements, so we'll revisit this later.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit b0d52ad821.
We need to revert the TXQ scheduling API due to conflicts
with a new driver, and this depends on that API.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Not only does this remove the need for the hexdump code in most
normal kernel builds (still there for the extra directory), but
it also removes the need to ship binary files, which apparently
is somewhat problematic, as Randy reported.
While at it, also add the generated files to clean-files.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the certs C code generation appends to the generated files,
which is most likely a leftover from commit 715a123347 ("wireless:
don't write C files on failures"). This causes duplicate code in the
generated files if the certificates have their timestamps modified
between builds and thereby trigger the generation rules.
Fixes: 715a123347 ("wireless: don't write C files on failures")
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is an old bugbear of mine:
https://www.mail-archive.com/netdev@vger.kernel.org/msg03894.html
By crafting special packets, it is possible to cause recursion
in our kernel when processing transport-mode packets at levels
that are only limited by packet size.
The easiest one is with DNAT, but an even worse one is where
UDP encapsulation is used in which case you just have to insert
an UDP encapsulation header in between each level of recursion.
This patch avoids this problem by reinjecting tranport-mode packets
through a tasklet.
Fixes: b05e106698 ("[IPV4/6]: Netfilter IPsec input hooks")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The function xdp_do_generic_redirect_map() is only used in this file, so
make it static.
Clean up sparse warning:
net/core/filter.c:2687:5: warning: no previous prototype
for 'xdp_do_generic_redirect_map' [-Wmissing-prototypes]
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
pskb_may_pull() can change skb->data, so we need to re-load pkt_md
and ershdr at the right place.
Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Fixes: f551c91de2 ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If pskb_may_pull return failed, return PACKET_REJECT
instead of -ENOMEM.
Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Fixes: f551c91de2 ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: William Tu <u9012063@gmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current HNCDSC handler takes the status flag from the AEN packet and
will update or change the current channel based on this flag and the
current channel status.
However the flag from the HNCDSC packet merely represents the host link
state. While the state of the host interface is potentially interesting
information it should not affect the state of the NCSI link. Indeed the
NCSI specification makes no mention of any recommended action related to
the host network controller driver state.
Update the HNCDSC handler to record the host network driver status but
take no other action.
Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The early call to br_stp_change_bridge_id in bridge's newlink can cause
a memory leak if an error occurs during the newlink because the fdb
entries are not cleaned up if a different lladdr was specified, also
another minor issue is that it generates fdb notifications with
ifindex = 0. Another unrelated memory leak is the bridge sysfs entries
which get added on NETDEV_REGISTER event, but are not cleaned up in the
newlink error path. To remove this special case the call to
br_stp_change_bridge_id is done after netdev register and we cleanup the
bridge on changelink error via br_dev_delete to plug all leaks.
This patch makes netlink bridge destruction on newlink error the same as
dellink and ioctl del which is necessary since at that point we have a
fully initialized bridge device.
To reproduce the issue:
$ ip l add br0 address 00:11:22:33:44:55 type bridge group_fwd_mask 1
RTNETLINK answers: Invalid argument
$ rmmod bridge
[ 1822.142525] =============================================================================
[ 1822.143640] BUG bridge_fdb_cache (Tainted: G O ): Objects remaining in bridge_fdb_cache on __kmem_cache_shutdown()
[ 1822.144821] -----------------------------------------------------------------------------
[ 1822.145990] Disabling lock debugging due to kernel taint
[ 1822.146732] INFO: Slab 0x0000000092a844b2 objects=32 used=2 fp=0x00000000fef011b0 flags=0x1ffff8000000100
[ 1822.147700] CPU: 2 PID: 13584 Comm: rmmod Tainted: G B O 4.15.0-rc2+ #87
[ 1822.148578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 1822.150008] Call Trace:
[ 1822.150510] dump_stack+0x78/0xa9
[ 1822.151156] slab_err+0xb1/0xd3
[ 1822.151834] ? __kmalloc+0x1bb/0x1ce
[ 1822.152546] __kmem_cache_shutdown+0x151/0x28b
[ 1822.153395] shutdown_cache+0x13/0x144
[ 1822.154126] kmem_cache_destroy+0x1c0/0x1fb
[ 1822.154669] SyS_delete_module+0x194/0x244
[ 1822.155199] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1822.155773] entry_SYSCALL_64_fastpath+0x23/0x9a
[ 1822.156343] RIP: 0033:0x7f929bd38b17
[ 1822.156859] RSP: 002b:00007ffd160e9a98 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0
[ 1822.157728] RAX: ffffffffffffffda RBX: 00005578316ba090 RCX: 00007f929bd38b17
[ 1822.158422] RDX: 00007f929bd9ec60 RSI: 0000000000000800 RDI: 00005578316ba0f0
[ 1822.159114] RBP: 0000000000000003 R08: 00007f929bff5f20 R09: 00007ffd160e8a11
[ 1822.159808] R10: 00007ffd160e9860 R11: 0000000000000202 R12: 00007ffd160e8a80
[ 1822.160513] R13: 0000000000000000 R14: 0000000000000000 R15: 00005578316ba090
[ 1822.161278] INFO: Object 0x000000007645de29 @offset=0
[ 1822.161666] INFO: Object 0x00000000d5df2ab5 @offset=128
Fixes: 30313a3d57 ("bridge: Handle IFLA_ADDRESS correctly when creating bridge device")
Fixes: 5b8d5429da ("bridge: netlink: register netdevice before executing changelink")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever a new type of chunk is added, the corresp conversion in
sctp_cname should be added. Otherwise, in some places, pr_debug
will print it as "unknown chunk".
Fixes: cc16f00f65 ("sctp: add support for generating stream reconf ssn reset request chunk")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now when reneging events in sctp_ulpq_renege(), the variable freed
could be increased by a __u16 value twice while freed is of __u16
type. It means freed may overflow at the second addition.
This patch is to fix it by using __u32 type for 'freed', while at
it, also to remove 'if (chunk)' check, as all renege commands are
generated in sctp_eat_data and it can't be NULL.
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A group member going into state LEAVING should never go back to any
other state before it is finally deleted. However, this might happen
if the socket needs to send out a RECLAIM message during this interval.
Since we forget to remove the leaving member from the group's 'active'
or 'pending' list, the member might be selected for reclaiming, change
state to RECLAIMING, and get stuck in this state instead of being
deleted. This might lead to suppression of the expected 'member down'
event to the receiver.
We fix this by removing the member from all lists, except the RB tree,
at the moment it goes into state LEAVING.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Group messages are not supposed to be returned to sender when the
destination socket disappears. This is done correctly for regular
traffic messages, by setting the 'dest_droppable' bit in the header.
But we forget to do that in group protocol messages. This has the effect
that such messages may sometimes bounce back to the sender, be perceived
as a legitimate peer message, and wreak general havoc for the rest of
the session. In particular, we have seen that a member in state LEAVING
may go back to state RECLAIMED or REMITTED, hence causing suppression
of an otherwise expected 'member down' event to the user.
We fix this by setting the 'dest_droppable' bit even in group protocol
messages.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2017-12-17
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a corner case in generic XDP where we have non-linear skbs
but enough tailroom in the skb to not miss to linearizing there,
from Song.
2) Fix BPF JIT bugs in s390x and ppc64 to not recache skb data when
BPF context is not skb, from Daniel.
3) Fix a BPF JIT bug in sparc64 where recaching skb data after helper
call would use the wrong register for the skb, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
One example of when an ICMPv6 packet is required to be looped back is
when a host acts as both a Multicast Listener and a Multicast Router.
A Multicast Router will listen on address ff02::16 for MLDv2 messages.
Currently, MLDv2 messages originating from a Multicast Listener running
on the same host as the Multicast Router are not being delivered to the
Multicast Router. This is due to dst.input being assigned the default
value of dst_discard.
This results in the packet being looped back but discarded before being
delivered to the Multicast Router.
This patch sets dst.input to ip6_input to ensure a looped back packet
is delivered to the Multicast Router.
Signed-off-by: Brendan McGrath <redmcg@redmandi.dyndns.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.
Signed-off-by: David S. Miller <davem@davemloft.net>