This commit makes BBR use only the MSS (without any headers) to
calculate pacing rates when internal TCP-layer pacing is used.
This is necessary to achieve the correct pacing behavior in this case,
since tcp_internal_pacing() uses only the payload length to calculate
pacing delays.
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving multiple packets with the same ts ecr value, only try
to compute rcv_rtt sample with the earliest received packet.
This is because the rcv_rtt calculated by later received packets
could possibly include long idle time or other types of delay.
For example:
(1) server sends last packet of reply with TS val V1
(2) client ACKs last packet of reply with TS ecr V1
(3) long idle time passes
(4) client sends next request data packet with TS ecr V1 (again!)
At this time, the rcv_rtt computed on server with TS ecr V1 will be
inflated with the idle time and should get ignored.
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This "feature" is unused, undocumented, and untested and so doesn't
really belong. A patch is under development to properly implement
support for detecting when a search gets diverted down a different
chain, which the common purpose of nulls markers.
This patch actually fixes a bug too. The table resizing allows a
table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
any growth beyond 2^27 is wasteful an ineffective.
This patch results in NULLS_MARKER(0) being used for all chains,
and leaves the use of rht_is_a_null() to test for it.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the use of rhashtables in net namespaces,
rhashtable.h is included in lots of the kernel,
so a small changes can required a large recompilation.
This makes development painful.
This patch splits out rhashtable-types.h which just includes
the major type declarations, and does not include (non-trivial)
inline code. rhashtable.h is no longer included by anything
in the include/ directory.
Common include files only include rhashtable-types.h so a large
recompilation is only triggered when that changes.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dst_cid and src_cid are 64 bits, therefore 64 bit accessors should be
used, and in fact in virtio_transport_common.c only 64 bit accessors are
used. Using 32 bit accessors for 64 bit values breaks big endian systems.
This patch fixes a wrong use of le32_to_cpu in virtio_transport_send_pkt.
Fixes: b911682318 ("VSOCK: add loopback to virtio_transport")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If flower filter is created without the skip_sw flag, fl_mask_put()
can race with fl_classify() and we can destroy the mask rhashtable
while a lookup operation is accessing it.
BUG: unable to handle kernel paging request at 00000000000911d1
PGD 0 P4D 0
SMP PTI
CPU: 3 PID: 5582 Comm: vhost-5541 Not tainted 4.18.0-rc1.vanilla+ #1950
Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
RIP: 0010:rht_bucket_nested+0x20/0x60
Code: 31 c8 c1 c1 18 29 c8 c3 66 90 8b 4f 04 ba 01 00 00 00 8b 07 48 8b bf 80 00 00 0
RSP: 0018:ffffafc5cfbb7a48 EFLAGS: 00010206
RAX: 0000000000001978 RBX: ffff9f12dff88a00 RCX: 00000000ffff9f12
RDX: 00000000000911d1 RSI: 0000000000000148 RDI: 0000000000000001
RBP: ffff9f12dff88a00 R08: 000000005f1cc119 R09: 00000000a715fae2
R10: ffffafc5cfbb7aa8 R11: ffff9f1cb4be804e R12: ffff9f1265e13000
R13: 0000000000000000 R14: ffffafc5cfbb7b48 R15: ffff9f12dff88b68
FS: 0000000000000000(0000) GS:ffff9f1d3f0c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000911d1 CR3: 0000001575a94006 CR4: 00000000001626e0
Call Trace:
fl_lookup+0x134/0x140 [cls_flower]
fl_classify+0xf3/0x180 [cls_flower]
tcf_classify+0x78/0x150
__netif_receive_skb_core+0x69e/0xa50
netif_receive_skb_internal+0x42/0xf0
tun_get_user+0xdd5/0xfd0 [tun]
tun_sendmsg+0x52/0x70 [tun]
handle_tx+0x2b3/0x5f0 [vhost_net]
vhost_worker+0xab/0x100 [vhost]
kthread+0xf8/0x130
ret_from_fork+0x35/0x40
Modules linked in: act_mirred act_gact cls_flower vhost_net vhost tap sch_ingress
CR2: 00000000000911d1
Fix the above waiting for a RCU grace period before destroying the
rhashtable: we need to use tcf_queue_work(), as rhashtable_destroy()
must run in process context, as pointed out by Cong Wang.
v1 -> v2: use tcf_queue_work to run rhashtable_destroy().
Fixes: 05cd271fd6 ("cls_flower: Support multiple masks per priority")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hightlights include:
Bugfixes:
- Fix an rcu deadlock in nfs_delegation_find_inode()
- Fix NFSv4 deadlocks due to not freeing the session slot in layoutget
- Don't send layoutreturn if the layout is already invalid
- Prevent duplicate XID allocation
- flexfiles: Don't tie up all the rpciod threads in resends
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbK9avAAoJEA4mA3inWBJclq0P/1VCigyDlsbtdby3z2leV84k
l0asrGjOQndljJ7I21awAgEo8KvXOd66cMv6YT3+UqEW18aNblH4/ngjyId6hPVb
RZDX7tsG16ZHEqfe9f9irNZo90mdvuSC4ChJ/CbbPesaK9pblE1d76b/qUVr4FUX
Gj7JPAC5ckoiZXPFfRWfc+o7JnGvs5wkEuDTy+ig6v7BRdL64hdPG3veRNpmLIAZ
uS/NCyRpO+nFN/ukmvuoI2ZQ3qfHubHBD+rHxr1UKT/ad7dywLmL2UBaYQ0Tl3bq
/iSQHutgJYj/80VaRTqdlLt/m4ebUZg+9BEZgM5MvqBWkXcpXND51zxExVJN4cGW
BOytqjLz0gP1OGb8w+Oow58K8l4XyEgHe2CtZ6Yz8Vwof7nchkpv7RSX50hJFIcA
YlikeDyDzfOmTT6ove5kF31WQSa3Bk6OMEei0of6hWU3UVHyEdr9az73pm/CLSHE
/R7w0osU3B9tmQD4btQeJ2DxP+syQwhelOYodyVTwOlkmmGg7DSV7fehnGyH8t8f
I4Yp8f0raiYGbwonYVE2+zDO140VRETEfTE4XQZnn41fZUfB74oIqk77JtgvGMk2
/+XFNCYBGadHdSBdxyJmhSjhoAWrhgChEIz1G12SiHrNvqIRY/uHhdCX1Ut5vlPf
5aqyn/yXm6rUH7aNh/Gd
=tz0M
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Hightlights include:
- fix an rcu deadlock in nfs_delegation_find_inode()
- fix NFSv4 deadlocks due to not freeing the session slot in
layoutget
- don't send layoutreturn if the layout is already invalid
- prevent duplicate XID allocation
- flexfiles: Don't tie up all the rpciod threads in resends"
* tag 'nfs-for-4.18-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pNFS/flexfiles: Process writeback resends from nfsiod context as well
pNFS/flexfiles: Don't tie up all the rpciod threads in resends
sunrpc: Prevent duplicate XID allocation
pNFS: Don't send layoutreturn if the layout is already invalid
pNFS: Always free the session slot on error in nfs4_layoutget_handle_exception
NFS: Fix an rcu deadlock in nfs_delegation_find_inode()
Currently it is incrementing SctpFragUsrMsgs when the user message size
is of the exactly same size as the maximum fragment size, which is wrong.
The fix is to increment it only when user message is bigger than the
maximum fragment size.
Fixes: bfd2e4b873 ("sctp: refactor sctp_datamsg_from_user")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function strp_data_ready(), it is useless to call queue_work if
the state of strparser is already paused. The state checking should
be done before calling queue_work. The change reduces the context
switches and improves the ktls-rx throughput by approx 20% (measured
on cortex-a53 based platform).
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use $(OBJDUMP) instead of literal 'objdump' to avoid
using host toolchain when cross compiling.
Fixes: 421780fd49 ("bpfilter: fix build error")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Reported-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ipcm(6)_cookie field gso_size is set only in the udp path. The ip
layer copies this to cork only if sk_type is SOCK_DGRAM. This check
proved too permissive. Ping and l2tp sockets have the same type.
Limit to sockets of type SOCK_DGRAM and protocol IPPROTO_UDP to
exclude ping sockets.
v1 -> v2
- remove irrelevant whitespace changes
Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/bpfilter/bpfilter_umh is a binary file generated when bpfilter is
enabled, add it to .gitignore to avoid committing it.
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpfilter Makefile assumes that the system locale is en_US, and the
parsing of objdump output fails.
Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
only 2 processes instead of 7.
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
in the following script
# tc actions add action ife encode allow prio pass index 42
# tc actions replace action ife encode allow tcindex drop index 42
the action control should remain equal to 'pass', if the kernel failed
to replace the TC action. Pospone the assignment of the action control,
to ensure it is not overwritten in the error path of tcf_ife_init().
Fixes: ef6980b6be ("introduce IFE action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
a recursive lock warning [1] can be observed with the following script,
# $TC actions add action ife encode allow prio pass index 42
IFE type 0xED3E
# $TC actions replace action ife encode allow tcindex pass index 42
in case the kernel was unable to run the last command (e.g. because of
the impossibility to load 'act_meta_skbtcindex'). For a similar reason,
the kernel can leak idr in the error path of tcf_ife_init(), because
tcf_idr_release() is not called after successful idr reservation:
# $TC actions add action ife encode allow tcindex index 47
IFE type 0xED3E
RTNETLINK answers: No such file or directory
We have an error talking to the kernel
# $TC actions add action ife encode allow tcindex index 47
IFE type 0xED3E
RTNETLINK answers: No space left on device
We have an error talking to the kernel
# $TC actions add action ife encode use mark 7 type 0xfefe pass index 47
IFE type 0xFEFE
RTNETLINK answers: No space left on device
We have an error talking to the kernel
Since tcfa_lock is already taken when the action is being edited, a call
to tcf_idr_release() wrongly makes tcf_idr_cleanup() take the same lock
again. On the other hand, tcf_idr_release() needs to be called in the
error path of tcf_ife_init(), to undo the last tcf_idr_create() invocation.
Fix both problems in tcf_ife_init().
Since the cleanup() routine can now be called when ife->params is NULL,
also add a NULL pointer check to avoid calling kfree_rcu(NULL, rcu).
[1]
============================================
WARNING: possible recursive locking detected
4.17.0-rc4.kasan+ #417 Tainted: G E
--------------------------------------------
tc/3932 is trying to acquire lock:
000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_cleanup+0x19/0x80 [act_ife]
but task is already holding lock:
000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_init+0xf6d/0x13c0 [act_ife]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&p->tcfa_lock)->rlock);
lock(&(&p->tcfa_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by tc/3932:
#0: 000000007ca8e990 (rtnl_mutex){+.+.}, at: tcf_ife_init+0xf61/0x13c0 [act_ife]
#1: 000000005097c9a6 (&(&p->tcfa_lock)->rlock){+...}, at: tcf_ife_init+0xf6d/0x13c0 [act_ife]
stack backtrace:
CPU: 3 PID: 3932 Comm: tc Tainted: G E 4.17.0-rc4.kasan+ #417
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
dump_stack+0x9a/0xeb
__lock_acquire+0xf43/0x34a0
? debug_check_no_locks_freed+0x2b0/0x2b0
? debug_check_no_locks_freed+0x2b0/0x2b0
? debug_check_no_locks_freed+0x2b0/0x2b0
? __mutex_lock+0x62f/0x1240
? kvm_sched_clock_read+0x1a/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x39/0x1d0
? lock_acquire+0x10b/0x330
lock_acquire+0x10b/0x330
? tcf_ife_cleanup+0x19/0x80 [act_ife]
_raw_spin_lock_bh+0x38/0x70
? tcf_ife_cleanup+0x19/0x80 [act_ife]
tcf_ife_cleanup+0x19/0x80 [act_ife]
__tcf_idr_release+0xff/0x350
tcf_ife_init+0xdde/0x13c0 [act_ife]
? ife_exit_net+0x290/0x290 [act_ife]
? __lock_is_held+0xb4/0x140
tcf_action_init_1+0x67b/0xad0
? tcf_action_dump_old+0xa0/0xa0
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? kvm_sched_clock_read+0x1a/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? memset+0x1f/0x40
tcf_action_init+0x30f/0x590
? tcf_action_init_1+0xad0/0xad0
? memset+0x1f/0x40
tc_ctl_action+0x48e/0x5e0
? mutex_lock_io_nested+0x1160/0x1160
? tca_action_gd+0x990/0x990
? sched_clock+0x5/0x10
? find_held_lock+0x39/0x1d0
rtnetlink_rcv_msg+0x4da/0x990
? validate_linkmsg+0x680/0x680
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x39/0x1d0
netlink_rcv_skb+0x127/0x350
? validate_linkmsg+0x680/0x680
? netlink_ack+0x970/0x970
? __kmalloc_node_track_caller+0x304/0x3a0
netlink_unicast+0x40f/0x5d0
? netlink_attachskb+0x580/0x580
? _copy_from_iter_full+0x187/0x760
? import_iovec+0x90/0x390
netlink_sendmsg+0x67f/0xb50
? netlink_unicast+0x5d0/0x5d0
? copy_msghdr_from_user+0x206/0x340
? netlink_unicast+0x5d0/0x5d0
sock_sendmsg+0xb3/0xf0
___sys_sendmsg+0x60a/0x8b0
? copy_msghdr_from_user+0x340/0x340
? lock_downgrade+0x5e0/0x5e0
? tty_write_lock+0x18/0x50
? kvm_sched_clock_read+0x1a/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x39/0x1d0
? lock_downgrade+0x5e0/0x5e0
? lock_acquire+0x10b/0x330
? __audit_syscall_entry+0x316/0x690
? current_kernel_time64+0x6b/0xd0
? __fget_light+0x55/0x1f0
? __sys_sendmsg+0xd2/0x170
__sys_sendmsg+0xd2/0x170
? __ia32_sys_shutdown+0x70/0x70
? syscall_trace_enter+0x57a/0xd60
? rcu_read_lock_sched_held+0xdc/0x110
? __bpf_trace_sys_enter+0x10/0x10
? do_syscall_64+0x22/0x480
do_syscall_64+0xa5/0x480
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fd646988ba0
RSP: 002b:00007fffc9fab3c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fffc9fab4f0 RCX: 00007fd646988ba0
RDX: 0000000000000000 RSI: 00007fffc9fab440 RDI: 0000000000000003
RBP: 000000005b28c8b3 R08: 0000000000000002 R09: 0000000000000000
R10: 00007fffc9faae20 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fffc9fab504 R14: 0000000000000001 R15: 000000000066c100
Fixes: 4e8c861550 ("net sched: net sched: ife action fix late binding")
Fixes: ef6980b6be ("introduce IFE action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
if dev_get_valid_name failed, propagate its return code
and remove the setting err to ENODEV, it will be set to
0 again before dev_change_net_namespace exits.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to 69678bcd4d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
need to fail if dev_match is not true. Currently, a packet to a given port
can match a socket bound to device when it should not. In the VRF case,
this causes the lookup to hit a VRF socket and not a global socket
resulting in a response trying to go through the VRF when it should not.
Fixes: 3fa6f616a7 ("net: ipv4: add second dif to inet socket lookups")
Fixes: 4297a0ef08 ("net: ipv6: add second dif to inet6 socket lookups")
Reported-by: Lou Berger <lberger@labn.net>
Diagnosed-by: Renato Westphal <renato@opensourcerouting.org>
Tested-by: Renato Westphal <renato@opensourcerouting.org>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This moves all of the netdev_printk(KERN_DEBUG, ...) messages over to
netdev_dbg.
As Joe explains:
> netdev_dbg is not included in object code unless
> DEBUG is defined or CONFIG_DYNAMIC_DEBUG is set.
> And then, it is not emitted into the log unless
> DEBUG is set or this specific netdev_dbg is enabled
> via the dynamic debug control file.
Which is what we're after in this case.
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This does not provide useful information. As the ncsi maintainer said:
> either we get a channel or broadcom has gone out to lunch
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In normal operation we see this series of messages as the host drives
the network device:
ftgmac100 1e660000.ethernet eth0: NCSI: LSC AEN - channel 0 state down
ftgmac100 1e660000.ethernet eth0: NCSI: suspending channel 0
ftgmac100 1e660000.ethernet eth0: NCSI: configuring channel 0
ftgmac100 1e660000.ethernet eth0: NCSI: channel 0 link down after config
ftgmac100 1e660000.ethernet eth0: NCSI interface down
ftgmac100 1e660000.ethernet eth0: NCSI: LSC AEN - channel 0 state up
ftgmac100 1e660000.ethernet eth0: NCSI: configuring channel 0
ftgmac100 1e660000.ethernet eth0: NCSI interface up
ftgmac100 1e660000.ethernet eth0: NCSI: LSC AEN - channel 0 state down
ftgmac100 1e660000.ethernet eth0: NCSI: suspending channel 0
ftgmac100 1e660000.ethernet eth0: NCSI: configuring channel 0
ftgmac100 1e660000.ethernet eth0: NCSI: channel 0 link down after config
ftgmac100 1e660000.ethernet eth0: NCSI interface down
ftgmac100 1e660000.ethernet eth0: NCSI: LSC AEN - channel 0 state up
ftgmac100 1e660000.ethernet eth0: NCSI: configuring channel 0
ftgmac100 1e660000.ethernet eth0: NCSI interface up
This makes all of these messages netdev_dbg. They are still useful to
debug eg. misbehaving network device firmware, but we do not need them
filling up the kernel logs in normal operation.
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Krzysztof Kozlowski <krzk@kernel.org> reports that a heavy NFSv4
WRITE workload against a slow NFS server causes his Raspberry Pi
clients to stall. Krzysztof bisected it to commit 37ac86c3a7
("SUNRPC: Initialize rpc_rqst outside of xprt->reserve_lock") .
I was able to reproduce similar behavior and it appears that rarely
the RPC client layer is re-allocating an XID for an RPC that it has
already partially sent. This results in the client ignoring the
subsequent reply, which carries the original XID.
For various reasons, checking !req->rq_xmit_bytes_sent in
xprt_prepare_transmit is not a 100% reliable mechanism for
determining when a fresh XID is needed.
Trond's preference is to allocate the XID at the time each rpc_rqst
slot is initialized.
This patch should also address a gcc 4.1.2 complaint reported by
Geert Uytterhoeven <geert@linux-m68k.org>.
Reported-by: Krzysztof Kozlowski <krzk@kernel.org>
Fixes: 37ac86c3a7 ("SUNRPC: Initialize rpc_rqst outside of ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
When blackhole is used on top of classful qdisc like hfsc it breaks
qlen and backlog counters because packets are disappear without notice.
In HFSC non-zero qlen while all classes are inactive triggers warning:
WARNING: ... at net/sched/sch_hfsc.c:1393 hfsc_dequeue+0xba4/0xe90 [sch_hfsc]
and schedules watchdog work endlessly.
This patch return __NET_XMIT_BYPASS in addition to NET_XMIT_SUCCESS,
this flag tells upper layer: this packet is gone and isn't queued.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
ATM accounts for in-flight TX packets in sk_wmem_alloc of the VCC on
which they are to be sent. But it doesn't take ownership of those
packets from the sock (if any) which originally owned them. They should
remain owned by their actual sender until they've left the box.
There's a hack in pskb_expand_head() to avoid adjusting skb->truesize
for certain skbs, precisely to avoid messing up sk_wmem_alloc
accounting. Ideally that hack would cover the ATM use case too, but it
doesn't — skbs which aren't owned by any sock, for example PPP control
frames, still get their truesize adjusted when the low-level ATM driver
adds headroom.
This has always been an issue, it seems. The truesize of a packet
increases, and sk_wmem_alloc on the VCC goes negative. But this wasn't
for normal traffic, only for control frames. So I think we just got away
with it, and we probably needed to send 2GiB of LCP echo frames before
the misaccounting would ever have caused a problem and caused
atm_may_send() to start refusing packets.
Commit 14afee4b60 ("net: convert sock.sk_wmem_alloc from atomic_t to
refcount_t") did exactly what it was intended to do, and turned this
mostly-theoretical problem into a real one, causing PPPoATM to fail
immediately as sk_wmem_alloc underflows and atm_may_send() *immediately*
starts refusing to allow new packets.
The least intrusive solution to this problem is to stash the value of
skb->truesize that was accounted to the VCC, in a new member of the
ATM_SKB(skb) structure. Then in atm_pop_raw() subtract precisely that
value instead of the then-current value of skb->truesize.
Fixes: 158f323b98 ("net: adjust skb->truesize in pskb_expand_head()")
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Tested-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-16
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a panic in devmap handling in generic XDP where return type
of __devmap_lookup_elem() got changed recently but generic XDP
code missed the related update, from Toshiaki.
2) Fix a freeze when BPF progs are loaded that include BPF to BPF
calls when JIT is enabled where we would later bail out via error
path w/o dropping kallsyms, and another one to silence syzkaller
splats from locking prog read-only, from Daniel.
3) Fix a bug in test_offloads.py BPF selftest which must not assume
that the underlying system have no BPF progs loaded prior to test,
and one in bpftool to fix accuracy of program load time, from Jakub.
4) Fix a bug in bpftool's probe for availability of the bpf(2)
BPF_TASK_FD_QUERY subcommand, from Yonghong.
5) Fix a regression in AF_XDP's XDP_SKB receive path where queue
id check got erroneously removed, from Björn.
6) Fix missing state cleanup in BPF's xfrm tunnel test, from William.
7) Check tunnel type more accurately in BPF's tunnel collect metadata
kselftest, from Jian.
8) Fix missing Kconfig fragments for BPF kselftests, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Various netfilter fixlets from Pablo and the netfilter team.
2) Fix regression in IPVS caused by lack of PMTU exceptions on local
routes in ipv6, from Julian Anastasov.
3) Check pskb_trim_rcsum for failure in DSA, from Zhouyang Jia.
4) Don't crash on poll in TLS, from Daniel Borkmann.
5) Revert SO_REUSE{ADDR,PORT} change, it regresses various things
including Avahi mDNS. From Bart Van Assche.
6) Missing of_node_put in qcom/emac driver, from Yue Haibing.
7) We lack checking of the TCP checking in one special case during SYN
receive, from Frank van der Linden.
8) Fix module init error paths of mac80211 hwsim, from Johannes Berg.
9) Handle 802.1ad properly in stmmac driver, from Elad Nachman.
10) Must grab HW caps before doing quirk checks in stmmac driver, from
Jose Abreu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits)
net: stmmac: Run HWIF Quirks after getting HW caps
neighbour: skip NTF_EXT_LEARNED entries during forced gc
net: cxgb3: add error handling for sysfs_create_group
tls: fix waitall behavior in tls_sw_recvmsg
tls: fix use-after-free in tls_push_record
l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()
l2tp: reject creation of non-PPP sessions on L2TPv2 tunnels
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
mlxsw: spectrum_router: Align with new route replace logic
mlxsw: spectrum_router: Allow appending to dev-only routes
ipv6: Only emit append events for appended routes
stmmac: added support for 802.1ad vlan stripping
cfg80211: fix rcu in cfg80211_unregister_wdev
mac80211: Move up init of TXQs
mac80211_hwsim: fix module init error paths
cfg80211: initialize sinfo in cfg80211_get_station
nl80211: fix some kernel doc tag mistakes
hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload
rds: avoid unenecessary cong_update in loop transport
l2tp: clean up stale tunnel or session in pppol2tp_connect's error path
...
Commit 67f29e07e1 ("bpf: devmap introduce dev_map_enqueue") changed
the return value type of __devmap_lookup_elem() from struct net_device *
to struct bpf_dtab_netdev * but forgot to modify generic XDP code
accordingly.
Thus generic XDP incorrectly used struct bpf_dtab_netdev where struct
net_device is expected, then skb->dev was set to invalid value.
v2:
- Fix compiler warning without CONFIG_BPF_SYSCALL.
Fixes: 67f29e07e1 ("bpf: devmap introduce dev_map_enqueue")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Commit 9ce33e4653 ("neighbour: support for NTF_EXT_LEARNED flag")
added support for NTF_EXT_LEARNED for neighbour entries.
NTF_EXT_LEARNED entries are neigh entries managed by control
plane (eg: Ethernet VPN implementation in FRR routing suite).
Periodic gc already excludes these entries. This patch extends
it to forced gc which the earlier patch missed.
Fixes: 9ce33e4653 ("neighbour: support for NTF_EXT_LEARNED flag")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current behavior in tls_sw_recvmsg() is to wait for incoming tls
messages and copy up to exactly len bytes of data that the user
provided. This is problematic in the sense that i) if no packet
is currently queued in strparser we keep waiting until one has been
processed and pushed into tls receive layer for tls_wait_data() to
wake up and push the decrypted bits to user space. Given after
tls decryption, we're back at streaming data, use sock_rcvlowat()
hint from tcp socket instead. Retain current behavior with MSG_WAITALL
flag and otherwise use the hint target for breaking the loop and
returning to application. This is done if currently no ctx->recv_pkt
is ready, otherwise continue to process it from our strparser
backlog.
Fixes: c46234ebb4 ("tls: RX path for ktls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller managed to trigger a use-after-free in tls like the
following:
BUG: KASAN: use-after-free in tls_push_record.constprop.15+0x6a2/0x810 [tls]
Write of size 1 at addr ffff88037aa08000 by task a.out/2317
CPU: 3 PID: 2317 Comm: a.out Not tainted 4.17.0+ #144
Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
Call Trace:
dump_stack+0x71/0xab
print_address_description+0x6a/0x280
kasan_report+0x258/0x380
? tls_push_record.constprop.15+0x6a2/0x810 [tls]
tls_push_record.constprop.15+0x6a2/0x810 [tls]
tls_sw_push_pending_record+0x2e/0x40 [tls]
tls_sk_proto_close+0x3fe/0x710 [tls]
? tcp_check_oom+0x4c0/0x4c0
? tls_write_space+0x260/0x260 [tls]
? kmem_cache_free+0x88/0x1f0
inet_release+0xd6/0x1b0
__sock_release+0xc0/0x240
sock_close+0x11/0x20
__fput+0x22d/0x660
task_work_run+0x114/0x1a0
do_exit+0x71a/0x2780
? mm_update_next_owner+0x650/0x650
? handle_mm_fault+0x2f5/0x5f0
? __do_page_fault+0x44f/0xa50
? mm_fault_error+0x2d0/0x2d0
do_group_exit+0xde/0x300
__x64_sys_exit_group+0x3a/0x50
do_syscall_64+0x9a/0x300
? page_fault+0x8/0x30
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happened through fault injection where aead_req allocation in
tls_do_encryption() eventually failed and we returned -ENOMEM from
the function. Turns out that the use-after-free is triggered from
tls_sw_sendmsg() in the second tls_push_record(). The error then
triggers a jump to waiting for memory in sk_stream_wait_memory()
resp. returning immediately in case of MSG_DONTWAIT. What follows is
the trim_both_sgl(sk, orig_size), which drops elements from the sg
list added via tls_sw_sendmsg(). Now the use-after-free gets triggered
when the socket is being closed, where tls_sk_proto_close() callback
is invoked. The tls_complete_pending_work() will figure that there's
a pending closed tls record to be flushed and thus calls into the
tls_push_pending_closed_record() from there. ctx->push_pending_record()
is called from the latter, which is the tls_sw_push_pending_record()
from sw path. This again calls into tls_push_record(). And here the
tls_fill_prepend() will panic since the buffer address has been freed
earlier via trim_both_sgl(). One way to fix it is to move the aead
request allocation out of tls_do_encryption() early into tls_push_record().
This means we don't prep the tls header and advance state to the
TLS_PENDING_CLOSED_RECORD before allocation which could potentially
fail happened. That fixes the issue on my side.
Fixes: 3c4d755915 ("tls: kernel TLS support")
Reported-by: syzbot+5c74af81c547738e1684@syzkaller.appspotmail.com
Reported-by: syzbot+709f2810a6a05f11d4d3@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pppol2tp_tunnel_ioctl() can act on an L2TPv3 tunnel, in which case
'session' may be an Ethernet pseudo-wire.
However, pppol2tp_session_ioctl() expects a PPP pseudo-wire, as it
assumes l2tp_session_priv() points to a pppol2tp_session structure. For
an Ethernet pseudo-wire l2tp_session_priv() points to an l2tp_eth_sess
structure instead, making pppol2tp_session_ioctl() access invalid
memory.
Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
The /proc/net/pppol2tp handlers (pppol2tp_seq_*()) iterate over all
L2TPv2 tunnels, and rightfully expect that only PPP sessions can be
found there. However, l2tp_netlink accepts creating Ethernet sessions
regardless of the underlying tunnel version.
This confuses pppol2tp_seq_session_show(), which expects that
l2tp_session_priv() returns a pppol2tp_session structure. When the
session is an Ethernet pseudo-wire, a struct l2tp_eth_sess is returned
instead. This leads to invalid memory access when
pppol2tp_session_get_sock() later tries to dereference ps->sk.
Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current code will emit an append event in the FIB notification chain for
any route added with NLM_F_APPEND set, even if the route was not
appended to any existing route.
This is inconsistent with IPv4 where such an event is only emitted when
the new route is appended after an existing one.
Align IPv6 behavior with IPv4, thereby allowing listeners to more easily
handle these events.
Fixes: f34436a430 ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* missing RCU grace period enforcement led to drivers freeing
data structures before; fix from Dedy Lansky.
* hwsim module init error paths were messed up; fixed it myself
after a report from Colin King (who had sent a partial patch)
* kernel-doc tag errors; fix from Luca Coelho
* initialize the on-stack sinfo data structure when getting
station information; fix from Sven Eckelmann
* TXQ state dumping is now done from init, and when TXQs aren't
initialized yet at that point, bad things happen, move the
initialization; fix from Toke Høiland-Jørgensen.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlsjswEACgkQB8qZga/f
l8T7pg/8CkKKwJRR/0ZIrYOyku/yEuZHpZ071/9gFhUna86zEymC/NDDuu27frny
G4TYJuQl/x5fyTLMteFkLd4oqiBP43X5Mh/UZtTG5jZVfU2W5PQSALSHn8SEQID6
u2hQBY1vM82CQbgUH6UjlIps40Y92urP5OG6LhT9F/9DijPYgrkxoNote+4x5wgM
gdL3yvXrmsExdgRFjuHTUXX8pT2J01UyxJAmuF0k5jMQzj6JGnwrLiSjUe4/a1qn
qJC72GYAezXVjv8noIW+B+e5jwABj4AbL9lJbU8lxHymTykTVqLlJbtMWygFoF2Y
rw8Gc2+eqafDnHzlgO0GXls2X89cbpmxBd3rdxq6J16oiAyNDmILw+w5Ap85bzck
1lQ3OvzOi25+qTndHBuV4wK066UbYJVCybgN25RRbM0q+soTHRZrpk8wFC27fGW8
Qt1BmyLBMvTDBkjBQ983QxLOJ8o0iNMuyKPxnULmveBRuhGfrDVOf8w7ueDSGJ61
vlTBA3gUBCHrtl7L67Tr6iun4y8n8YWhOz6VljaN6L5FFg5fVX15d9me6hXmAtar
ETbQFTPNWpFxYRElaIe+RFdtLkBFbE9rGLUZ1gQZ+WhH5YT+hDSqLm9WOJ61PbTv
VmNKrelwjdXL06Ei6O6ptRO7hoy+N0Z2khqAOpvNwaE30NojGqQ=
=siAL
-----END PGP SIGNATURE-----
Merge tag 'mac80211-for-davem-2018-06-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
A handful of fixes:
* missing RCU grace period enforcement led to drivers freeing
data structures before; fix from Dedy Lansky.
* hwsim module init error paths were messed up; fixed it myself
after a report from Colin King (who had sent a partial patch)
* kernel-doc tag errors; fix from Luca Coelho
* initialize the on-stack sinfo data structure when getting
station information; fix from Sven Eckelmann
* TXQ state dumping is now done from init, and when TXQs aren't
initialized yet at that point, bad things happen, move the
initialization; fix from Toke Høiland-Jørgensen.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of cfg80211_unregister_wdev can free the wdev object
immediately after this function returns. This may crash the kernel
because this wdev object is still in use by other threads.
Add synchronize_rcu() after list_del_rcu to make sure wdev object can
be safely freed.
Signed-off-by: Dedy Lansky <dlansky@codeaurora.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
On init, ieee80211_if_add() dumps the interface. Since that now includes a
dump of the TXQ state, we need to initialise that before the dump happens.
So move up the TXQ initialisation to to before the call to
ieee80211_if_add().
Fixes: 52539ca89f ("cfg80211: Expose TXQ stats and parameters to userspace")
Reported-by: Niklas Cassel <niklas.cassel@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Tested-by: Niklas Cassel <niklas.cassel@linaro.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Most of the implementations behind cfg80211_get_station will not initialize
sinfo to zero before manipulating it. For example, the member "filled",
which indicates the filled in parts of this struct, is often only modified
by enabling certain bits in the bitfield while keeping the remaining bits
in their original state. A caller without a preinitialized sinfo.filled can
then no longer decide which parts of sinfo were filled in by
cfg80211_get_station (or actually the underlying implementations).
cfg80211_get_station must therefore take care that sinfo is initialized to
zero. Otherwise, the caller may tries to read information which was not
filled in and which must therefore also be considered uninitialized. In
batadv_v_elp_get_throughput's case, an invalid "random" expected throughput
may be stored for this neighbor and thus the B.A.T.M.A.N V algorithm may
switch to non-optimal neighbors for certain destinations.
Fixes: 7406353d43 ("cfg80211: implement cfg80211_get_station cfg80211 API")
Reported-by: Thomas Lauer <holminateur@gmail.com>
Reported-by: Marcel Schmidt <ff.z-casparistrasse@mailbox.org>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Loop transport which is self loopback, remote port congestion
update isn't relevant. Infact the xmit path already ignores it.
Receive path needs to do the same.
Reported-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
pppol2tp_connect() may create a tunnel or a session. Remove them in
case of error.
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
If 'fd' is negative, l2tp_tunnel_create() creates a tunnel socket using
the configuration passed in 'tcfg'. Currently, pppol2tp_connect() sets
the relevant fields to zero, tricking l2tp_tunnel_create() into setting
up an unusable kernel socket.
We can't set 'tcfg' with the required fields because there's no way to
get them from the current connect() parameters. So let's restrict
kernel sockets creation to the netlink API, which is the original use
case.
Fixes: 789a4a2c61 ("l2tp: Add support for static unmanaged L2TPv3 tunnels")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp_session_priv() returns a struct pppol2tp_session pointer only for
PPPoL2TP sessions. In particular, if the session is an L2TP_PWTYPE_ETH
pseudo-wire, l2tp_session_priv() returns a pointer to an l2tp_eth_sess
structure, which is much smaller than struct pppol2tp_session. This
leads to invalid memory dereference when trying to lock ps->sk_lock.
Fixes: d9e31d17ce ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Define cfg.pw_type so that the new session is created with its .pwtype
field properly set (L2TP_PWTYPE_PPP).
Not setting the pseudo-wire type had several annoying effects:
* Invalid value returned in the L2TP_ATTR_PW_TYPE attribute when
dumping sessions with the netlink API.
* Impossibility to delete the session using the netlink API (because
l2tp_nl_cmd_session_delete() gets the deletion callback function
from an array indexed by the session's pseudo-wire type).
Also, there are several cases where we should check a session's
pseudo-wire type. For example, pppol2tp_connect() should refuse to
connect a session that is not PPPoL2TP, but that requires the session's
.pwtype field to be properly set.
Fixes: f7faffa3ff ("l2tp: Add L2TPv3 protocol support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 079096f103 ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.
But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly. These lower-level processing functions do not do any checksum
verification.
Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.
Fixes: 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself). The rest is mostly
mount option handling cleanups from Chengguang and assorted fixes
from Zheng, Luis and Dongsheng.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJbIkigAAoJEEp/3jgCEfOL3EUH/1s7Ib3FgFzG/SPPKISxZOGr
ndZGg0rPT9mPIQ4rp6t0z/cDlMrluPmCK3sWrAPe//sZz9iZiuip+mCL0gUFXFNr
1kL2xDKkJzGxtP3UlUvr5CC6bnxLdeBXJRBDLk/swtphuqArKndlbN/iLZnCZivT
uJDk+vZTwNJ3UhQP4QdnOQLV60NYs+q4euTqbZF3+pDiRiONbxRfXC3adFsc8zL9
zlie3CHPbrQHWMsfNvbfM3rBH1WhTwEssDm+IEFlKl19q9SKP2WPZfmBcE1pmZ58
AhIMoNGdQha1FXS6N96kaPaqFgeysPnEPoyHDqLxsUMKqsvJlOEZsK1jujza4rE=
=EfXm
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The main piece is a set of libceph changes that revamps how OSD
requests are aborted, improving CephFS ENOSPC handling and making
"umount -f" actually work (Zheng and myself).
The rest is mostly mount option handling cleanups from Chengguang and
assorted fixes from Zheng, Luis and Dongsheng.
* tag 'ceph-for-4.18-rc1' of git://github.com/ceph/ceph-client: (31 commits)
rbd: flush rbd_dev->watch_dwork after watch is unregistered
ceph: update description of some mount options
ceph: show ino32 if the value is different with default
ceph: strengthen rsize/wsize/readdir_max_bytes validation
ceph: fix alignment of rasize
ceph: fix use-after-free in ceph_statfs()
ceph: prevent i_version from going back
ceph: fix wrong check for the case of updating link count
libceph: allocate the locator string with GFP_NOFAIL
libceph: make abort_on_full a per-osdc setting
libceph: don't abort reads in ceph_osdc_abort_on_full()
libceph: avoid a use-after-free during map check
libceph: don't warn if req->r_abort_on_full is set
libceph: use for_each_request() in ceph_osdc_abort_on_full()
libceph: defer __complete_request() to a workqueue
libceph: move more code into __complete_request()
libceph: no need to call flush_workqueue() before destruction
ceph: flush pending works before shutdown super
ceph: abort osd requests on force umount
libceph: introduce ceph_osdc_abort_requests()
...
Now sctp GSO uses skb_gro_receive() to append the data into head
skb frag_list. However it actually only needs very few code from
skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
of its members are not needed here.
This patch is to add sctp_packet_gso_append() to build GSO frames
instead of skb_gro_receive(), and it would avoid many unnecessary
checks and make the code clearer.
Note that sctp will use page frags instead of frag_list to build
GSO frames in another patch. But it may take time, as sctp's GSO
frames may have different size. skb_segment() can only split it
into the frags with the same size, which would break the border
of sctp chunks.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter patches for your net tree:
1) Fix NULL pointer dereference from nf_nat_decode_session() if NAT is
not loaded, from Prashant Bhole.
2) Fix socket extension module autoload.
3) Don't bogusly reject sets with the NFT_SET_EVAL flag set on from
the dynset extension.
4) Fix races with nf_tables module removal and netns exit path,
patches from Florian Westphal.
5) Don't hit BUG_ON if jumpstack goes too deep, instead hit
WARN_ON_ONCE, from Taehee Yoo.
6) Another NULL pointer dereference from ctnetlink, again if NAT is
not loaded, from Florian Westphal.
7) Fix x_tables match list corruption in xt_connmark module removal
path, also from Florian.
8) nf_conncount doesn't properly deal with conntrack zones, hence
garbage collector may get rid of entries in a different zone.
From Yi-Hung Wei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Additional struct_size() conversions (Matthew, Kees)
- Explicitly reported overflow fixes (Silvio, Kees)
- Add missing kvcalloc() function (Kees)
- Treewide conversions of allocators to use either 2-factor argument
variant when available, or array_size() and array3_size() as needed (Kees)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsgVtMWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJhsJEACLYe2EbwLFJz7emOT1KUGK5R1b
oVxJog0893WyMqgk9XBlA2lvTBRBYzR3tzsadfYo87L3VOBzazUv0YZaweJb65sF
bAvxW3nY06brhKKwTRed1PrMa1iG9R63WISnNAuZAq7+79mN6YgW4G6YSAEF9lW7
oPJoPw93YxcI8JcG+dA8BC9w7pJFKooZH4gvLUSUNl5XKr8Ru5YnWcV8F+8M4vZI
EJtXFmdlmxAledUPxTSCIojO8m/tNOjYTreBJt9K1DXKY6UcgAdhk75TRLEsp38P
fPvMigYQpBDnYz2pi9ourTgvZLkffK1OBZ46PPt8BgUZVf70D6CBg10vK47KO6N2
zreloxkMTrz5XohyjfNjYFRkyyuwV2sSVrRJqF4dpyJ4NJQRjvyywxIP4Myifwlb
ONipCM1EjvQjaEUbdcqKgvlooMdhcyxfshqJWjHzXB6BL22uPzq5jHXXugz8/ol8
tOSM2FuJ2sBLQso+szhisxtMd11PihzIZK9BfxEG3du+/hlI+2XgN7hnmlXuA2k3
BUW6BSDhab41HNd6pp50bDJnL0uKPWyFC6hqSNZw+GOIb46jfFcQqnCB3VZGCwj3
LH53Be1XlUrttc/NrtkvVhm4bdxtfsp4F7nsPFNDuHvYNkalAVoC3An0BzOibtkh
AtfvEeaPHaOyD8/h2Q==
=zUUp
-----END PGP SIGNATURE-----
Merge tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull more overflow updates from Kees Cook:
"The rest of the overflow changes for v4.18-rc1.
This includes the explicit overflow fixes from Silvio, further
struct_size() conversions from Matthew, and a bug fix from Dan.
But the bulk of it is the treewide conversions to use either the
2-factor argument allocators (e.g. kmalloc(a * b, ...) into
kmalloc_array(a, b, ...) or the array_size() macros (e.g. vmalloc(a *
b) into vmalloc(array_size(a, b)).
Coccinelle was fighting me on several fronts, so I've done a bunch of
manual whitespace updates in the patches as well.
Summary:
- Error path bug fix for overflow tests (Dan)
- Additional struct_size() conversions (Matthew, Kees)
- Explicitly reported overflow fixes (Silvio, Kees)
- Add missing kvcalloc() function (Kees)
- Treewide conversions of allocators to use either 2-factor argument
variant when available, or array_size() and array3_size() as needed
(Kees)"
* tag 'overflow-v4.18-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (26 commits)
treewide: Use array_size in f2fs_kvzalloc()
treewide: Use array_size() in f2fs_kzalloc()
treewide: Use array_size() in f2fs_kmalloc()
treewide: Use array_size() in sock_kmalloc()
treewide: Use array_size() in kvzalloc_node()
treewide: Use array_size() in vzalloc_node()
treewide: Use array_size() in vzalloc()
treewide: Use array_size() in vmalloc()
treewide: devm_kzalloc() -> devm_kcalloc()
treewide: devm_kmalloc() -> devm_kmalloc_array()
treewide: kvzalloc() -> kvcalloc()
treewide: kvmalloc() -> kvmalloc_array()
treewide: kzalloc_node() -> kcalloc_node()
treewide: kzalloc() -> kcalloc()
treewide: kmalloc() -> kmalloc_array()
mm: Introduce kvcalloc()
video: uvesafb: Fix integer overflow in allocation
UBIFS: Fix potential integer overflow in allocation
leds: Use struct_size() in allocation
Convert intel uncore to struct_size
...
smc->clcsock is an internal TCP socket, after TCP socket
converts to ->poll_mask, ->poll doesn't exist any more.
So just convert smc socket to ->poll_mask too.
Fixes: 2c7d3daceb ("net/tcp: convert to ->poll_mask")
Reported-by: syzbot+f5066e369b2d5fff630f@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Revert the patch mentioned in the subject because it breaks at least
the Avahi mDNS daemon. That patch namely causes the Ubuntu 18.04 Avahi
daemon to fail to start:
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Successfully called chroot().
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Successfully dropped remaining capabilities.
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: No service file found in /etc/avahi/services.
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: SO_REUSEADDR failed: Structure needs cleaning
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: SO_REUSEADDR failed: Structure needs cleaning
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Failed to create server: No suitable network protocol available
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: avahi-daemon 0.7 exiting.
Jun 12 09:49:24 ubuntu-vm systemd[1]: avahi-daemon.service: Main process exited, code=exited, status=255/n/a
Jun 12 09:49:24 ubuntu-vm systemd[1]: avahi-daemon.service: Failed with result 'exit-code'.
Jun 12 09:49:24 ubuntu-vm systemd[1]: Failed to start Avahi mDNS/DNS-SD Stack.
Fixes: f396922d86 ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we use check_hlist() for garbage colleciton. However, we
use the ‘zone’ from the counted entry to query the existence of
existing entries in the hlist. This could be wrong when they are in
different zones, and this patch fixes this issue.
Fixes: e59ea3df3f ("netfilter: xt_connlimit: honor conntrack zone if available")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This needs to use xt_unregister_targets, else new revision is left
on the list which then causes list to point to a target struct that has been free'd.
Fixes: 472a73e007 ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Dan Carpenter points out that deref occurs after NULL check, we should
re-fetch the pointer and check that instead.
Fixes: 2c205dd398 ("netfilter: add struct nf_nat_hook and use it")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When depth of chain is bigger than NFT_JUMP_STACK_SIZE, the nft_do_chain
crashes. But there is no need to crash hard here.
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If net namespace is exiting while nf_tables module is being removed
we can oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: nf_tables_flowtable_event+0x43/0xf0 [nf_tables]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
Modules linked in: nf_tables(-) nfnetlink [..]
unregister_netdevice_notifier+0xdd/0x130
nf_tables_module_exit+0x24/0x3a [nf_tables]
SyS_delete_module+0x1c5/0x240
do_syscall_64+0x74/0x190
Avoid this by attempting to take reference on the net namespace from
the notifiers. If it fails the namespace is exiting already, and nft
core is taking care of cleanup work.
We also need to make sure the netdev hook type gets removed
before netns ops removal, else notifier might be invoked with device
event for a netns where net->nft was never initialised (because
pernet ops was removed beforehand).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We must first remove the nfnetlink protocol handler when nf_tables module
is unloaded -- we don't want userspace to submit new change requests once
we've started to tear down nft state.
Furthermore, nfnetlink must not call any subsystem function after
call_batch returned -EAGAIN.
EAGAIN means the subsys mutex was dropped, so its unlikely but possible that
nf_tables subsystem was removed due to 'rmmod nf_tables' on another cpu.
Therefore, we must abort batch completely and not move on to next part of
the batch.
Last, we can't invoke ->abort unless we've checked that the subsystem is
still registered.
Change netns exit path of nf_tables to make sure any incompleted
transaction gets removed on exit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
NFT_SET_EVAL is signalling the kernel that this sets can be updated from
the evaluation path, even if there are no expressions attached to the
element. Otherwise, set updates with no expressions fail. Update
description to describe the right semantics.
Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add alias definition for module autoload when adding socket rules.
Fixes: 554ced0a6e ("netfilter: nf_tables: add support for native socket matching")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Highlights include:
Stable fixes:
- Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
- Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
- Revert an incorrect change to the NFSv4.1 callback channel
- Fix a bug in the NFSv4.1 sequence error handling
Features and optimisations:
- Support for piggybacking a LAYOUTGET operation to the OPEN compound
- RDMA performance enhancements to deal with transport congestion
- Add proper SPDX tags for NetApp-contributed RDMA source
- Do not request delegated file attributes (size+change) from the server
- Optimise away a GETATTR in the lookup revalidate code when doing NFSv4 OPEN
- Optimise away unnecessary lookups for rename targets
- Misc performance improvements when freeing NFSv4 delegations
Bugfixes and cleanups:
- Try to fail quickly if proto=rdma
- Clean up RDMA receive trace points
- Fix sillyrename to return the delegation when appropriate
- Misc attribute revalidation fixes
- Immediately clear the pNFS layout on a file when the server returns ESTALE
- Return NFS4ERR_DELAY when delegation/layout recalls fail due to igrab()
- Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbH8gIAAoJEA4mA3inWBJcpzYQAJYY3ykt9oLQgm/2b/D/weDe
6890M9W5nIeuZq5soWSpYsZTxqIFbGV4laG/eCTW1gUN1TitSZsoOp7kqhRHXOjq
Rv3ZvjlZsP2qv2SnzsEmhJsynfyB46d19smSTJhgQ8dnXhaZv04Wsd4krLHx0z6p
uUUis5Q1m+vL7HsFPp3iUareO/DFKeSkw2cQ2V5ksTIEiAzX7GC+Ex/KKWf82nrJ
hm7+Nq7rLf1QHJkQvsc3fYCMR4gIzEwUu6F8RyxCoAVgD6O90Hx6NbxnINaHDD4N
U0nRP5LwCyN9hbPWvwcH7Sn4ePDTos2yj2tFO5NP9btTLDVLFSGYZ2a74d9PRdAf
9jn6f6juSDwI7T6NXvkHzzkJG6Or9ABAUZo+yX5JoD6lmgOcPUJpLRy6fu7UxAuN
a5OZ7d9edYpOi0Kys8sDSIlLlxZtFkvybOMVuI3dSHsI+c0g39w8oarpqT2wXWMs
/ZtFz0FCreHhKkNtz7Z49z1UQHDv/XYM0WkcO+eaeK58RLIEE0pZHoMvPKP63lkI
nbbgHvBRAu38Jtvvu65Hpb/VpBcqNGM5hjN1cfW/BOqAPKW23s4vWVj+/1silfW/
uw0MkNrDC9endoALp/YMCcMwPvEw9Awt9y4KjMgfVgSnKwXd0HaSZ2zE6aJU3Wry
Fy2Tv0e0OH3z9Bi/LNuJ
=YWSl
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a 1-byte stack overflow in nfs_idmap_read_and_verify_message
- Fix a hang due to incorrect error returns in rpcrdma_convert_iovs()
- Revert an incorrect change to the NFSv4.1 callback channel
- Fix a bug in the NFSv4.1 sequence error handling
Features and optimisations:
- Support for piggybacking a LAYOUTGET operation to the OPEN compound
- RDMA performance enhancements to deal with transport congestion
- Add proper SPDX tags for NetApp-contributed RDMA source
- Do not request delegated file attributes (size+change) from the
server
- Optimise away a GETATTR in the lookup revalidate code when doing
NFSv4 OPEN
- Optimise away unnecessary lookups for rename targets
- Misc performance improvements when freeing NFSv4 delegations
Bugfixes and cleanups:
- Try to fail quickly if proto=rdma
- Clean up RDMA receive trace points
- Fix sillyrename to return the delegation when appropriate
- Misc attribute revalidation fixes
- Immediately clear the pNFS layout on a file when the server returns
ESTALE
- Return NFS4ERR_DELAY when delegation/layout recalls fail due to
igrab()
- Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY"
* tag 'nfs-for-4.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (80 commits)
skip LAYOUTRETURN if layout is invalid
NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
NFSv4: Fix a typo in nfs41_sequence_process
NFSv4: Revert commit 5f83d86cf5 ("NFSv4.x: Fix wraparound issues..")
NFSv4: Return NFS4ERR_DELAY when a layout recall fails due to igrab()
NFSv4: Return NFS4ERR_DELAY when a delegation recall fails due to igrab()
NFSv4.0: Remove transport protocol name from non-UCS client ID
NFSv4.0: Remove cl_ipaddr from non-UCS client ID
NFSv4: Fix a compiler warning when CONFIG_NFS_V4_1 is undefined
NFS: Filter cache invalidation when holding a delegation
NFS: Ignore NFS_INO_REVAL_FORCED in nfs_check_inode_attributes()
NFS: Improve caching while holding a delegation
NFS: Fix attribute revalidation
NFS: fix up nfs_setattr_update_inode
NFSv4: Ensure the inode is clean when we set a delegation
NFSv4: Ignore NFS_INO_REVAL_FORCED in nfs4_proc_access
NFSv4: Don't ask for delegated attributes when adding a hard link
NFSv4: Don't ask for delegated attributes when revalidating the inode
NFS: Pass the inode down to the getattr() callback
NFSv4: Don't request size+change attribute if they are delegated to us
...
from Chuck Lever with new trace points, miscellaneous cleanups, and
streamlining of the send and receive paths. Other than that, some
miscellaneous bugfixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbHtKUAAoJECebzXlCjuG+dfgP/2Z9PiJXlxKC2iISgkfMGmBd
MmWZYekYMtCe5raoiI720W5cGL7uBLoKnc+r57+n7bEGxV9OFwtspmKGn17P/zrY
YcBIdN7gjpqn8wrflLR4D09bGpnmaZG26jIt/v0TS+N1aFKO3gNXb0ZVSjUadlI0
UsKRbYxr8qucIENVtXhfA0eRivddadsKopAEwflUrxf+8oEaYszPFUfNXcGDpdHK
+6D2lFjr/Fn+z97Rbz/G3fMfldpYhUOpH28DOiCuKEpgamK3dYjx1WoGUANxcj3o
RsbHGZnMR6842Nj5aHus0k6Ao9bgqt6lx+jKlkvWYK+G2EfMfV9Z1gAipPY+IMbd
Zk5A4pnFpI1UG3sUlcnpaxAM/pHBs7heYGqj0hyocG8rB4V7SDZxp21Lv1fjTH/A
XHAkdiT4iSgI11J8YbmDBR1S7bAnfNm7GT24DsAkZLzh2f5Miq5m/ZMxDxQLAFCJ
3YKo2aNVjKvA/aOKDe5RMLZUhnmuhb8aMIDuQY2Ir1EK4S+7EYOiYAvqlbJrM3Ro
aLmb9BUzRRWmRydMKOeGkWiMj49lHRW6oJxvb33PDZEEqW/AlvmYEyMGfjhXzPDE
OZkvbdYrni4n5YboplxNnJyL0NJ6l5YAikV94SBWBknrnNv1psSZbDKoIgp2ghhQ
rdP842qSmDiZiXVlTr3e
=PuEk
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A relatively quiet cycle for nfsd.
The largest piece is an RDMA update from Chuck Lever with new trace
points, miscellaneous cleanups, and streamlining of the send and
receive paths.
Other than that, some miscellaneous bugfixes"
* tag 'nfsd-4.18' of git://linux-nfs.org/~bfields/linux: (26 commits)
nfsd: fix error handling in nfs4_set_delegation()
nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
Fix 16-byte memory leak in gssp_accept_sec_context_upcall
svcrdma: Fix incorrect return value/type in svc_rdma_post_recvs
svcrdma: Remove unused svc_rdma_op_ctxt
svcrdma: Persistently allocate and DMA-map Send buffers
svcrdma: Simplify svc_rdma_send()
svcrdma: Remove post_send_wr
svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt
svcrdma: Introduce svc_rdma_send_ctxt
svcrdma: Clean up Send SGE accounting
svcrdma: Refactor svc_rdma_dma_map_buf
svcrdma: Allocate recv_ctxt's on CPU handling Receives
svcrdma: Persistently allocate and DMA-map Receive buffers
svcrdma: Preserve Receive buffer until svc_rdma_sendto
svcrdma: Simplify svc_rdma_recv_ctxt_put
svcrdma: Remove sc_rq_depth
svcrdma: Introduce svc_rdma_recv_ctxt
svcrdma: Trace key RDMA API events
svcrdma: Trace key RPC/RDMA protocol events
...
Commit 173d3adb6f ("xsk: add zero-copy support for Rx") introduced a
regression on the XDP_SKB receive path, when the queue id checks were
removed. Now, they are back again.
Fixes: 173d3adb6f ("xsk: add zero-copy support for Rx")
Reported-by: Qi Zhang <qi.z.zhang@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This is not necessary. skb_gro_receive() will never change what
'head' points to.
In it's original implementation (see commit 71d93b39e5 ("net: Add
skb_gro_receive")), it did:
====================
+ *head = nskb;
+ nskb->next = p->next;
+ p->next = NULL;
====================
This sequence was removed in commit 58025e46ea ("net: gro: remove
obsolete code from skb_gro_receive()")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-12
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
umem setup, from Björn.
2) Silence a warning in bpf fs when an application tries to open(2) a
pinned bpf obj due to missing fops. Add a dummy open fop that continues
to just bail out in such case, from Daniel.
3) Fix a BPF selftest urandom_read build issue where gcc complains that
it gets built twice, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
While hacking on kTLS, I ran into the following panic from an
unprivileged netserver / netperf TCP session:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
PGD 800000037f378067 P4D 800000037f378067 PUD 3c0e61067 PMD 0
Oops: 0010 [#1] SMP KASAN PTI
CPU: 1 PID: 2289 Comm: netserver Not tainted 4.17.0+ #139
Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
RIP: 0010: (null)
Code: Bad RIP value.
RSP: 0018:ffff88036abcf740 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff88036f5f6800 RCX: 1ffff1006debed26
RDX: ffff88036abcf920 RSI: ffff8803cb1a4f00 RDI: ffff8803c258c280
RBP: ffff8803c258c280 R08: ffff8803c258c280 R09: ffffed006f559d48
R10: ffff88037aacea43 R11: ffffed006f559d49 R12: ffff8803c258c280
R13: ffff8803cb1a4f20 R14: 00000000000000db R15: ffffffffc168a350
FS: 00007f7e631f4700(0000) GS:ffff8803d1c80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000003ccf64005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? tls_sw_poll+0xa4/0x160 [tls]
? sock_poll+0x20a/0x680
? do_select+0x77b/0x11a0
? poll_schedule_timeout.constprop.12+0x130/0x130
? pick_link+0xb00/0xb00
? read_word_at_a_time+0x13/0x20
? vfs_poll+0x270/0x270
? deref_stack_reg+0xad/0xe0
? __read_once_size_nocheck.constprop.6+0x10/0x10
[...]
Debugging further, it turns out that calling into ctx->sk_poll() is
invalid since sk_poll itself is NULL which was saved from the original
TCP socket in order for tls_sw_poll() to invoke it.
Looks like the recent conversion from poll to poll_mask callback started
in 1525242310 ("net: add support for ->poll_mask in proto_ops") missed
to eventually convert kTLS, too: TCP's ->poll was converted over to the
->poll_mask in commit 2c7d3daceb ("net/tcp: convert to ->poll_mask")
and therefore kTLS wrongly saved the ->poll old one which is now NULL.
Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
tcp_poll_mask() as well that is mangled here.
Fixes: 2c7d3daceb ("net/tcp: convert to ->poll_mask")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Watson <davejwatson@fb.com>
Tested-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller reported a warning from xdp_umem_pin_pages():
WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
...
__do_kmalloc mm/slab.c:3713 [inline]
__kmalloc+0x25/0x760 mm/slab.c:3727
kmalloc_array include/linux/slab.h:634 [inline]
kcalloc include/linux/slab.h:645 [inline]
xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
__sys_setsockopt+0x1bd/0x390 net/socket.c:1935
__do_sys_setsockopt net/socket.c:1946 [inline]
__se_sys_setsockopt net/socket.c:1943 [inline]
__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
This is a warning about attempting to allocate more than
KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
the request is too big, the kernel is free to deny its allocation. In
this patch, the failed allocation attempt is silenced with
__GFP_NOWARN.
Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Reported-by: syzbot+4abadc5d69117b346506@syzkaller.appspotmail.com
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree:
1) Reject non-null terminated helper names from xt_CT, from Gao Feng.
2) Fix KASAN splat due to out-of-bound access from commit phase, from
Alexey Kodanev.
3) Missing conntrack hook registration on IPVS FTP helper, from Julian
Anastasov.
4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo.
5) Fix inverted check on packet xmit to non-local addresses, also from
Julian.
6) Fix ebtables alignment compat problems, from Alin Nastac.
7) Hook mask checks are not correct in xt_set, from Serhey Popovych.
8) Fix timeout listing of element in ipsets, from Jozsef.
9) Cap maximum timeout value in ipset, also from Jozsef.
10) Don't allow family option for hash:mac sets, from Florent Fourcot.
11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this
Florian.
12) Another bug reported by KASAN in the rbtree set backend, from
Taehee Yoo.
13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT.
From Gao Feng.
14) Missing initialization of match/target in ebtables, from Florian
Westphal.
15) Remove useless nft_dup.h file in include path, from C. Labbe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When pskb_trim_rcsum fails, the lack of error-handling code may
cause unexpected results.
This patch adds error-handling code after calling pskb_trim_rcsum.
Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPVS setups with local client and remote tunnel server need
to create exception for the local virtual IP. What we do is to
change PMTU from 64KB (on "lo") to 1460 in the common case.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Fixes: 7343ff31eb ("ipv6: Don't create clones of host routes.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix several bpfilter/UMH bugs, in particular make the UMH build not
depend upon X86 specific Kconfig symbols. From Alexei Starovoitov.
2) Fix handling of modified context pointer in bpf verifier, from
Daniel Borkmann.
3) Kill regression in ifdown/ifup sequences for hv_netvsc driver, from
Dexuan Cui.
4) When the bonding primary member name changes, we have to re-evaluate
the bond->force_primary setting, from Xiangning Yu.
5) Eliminate possible padding beyone end of SKB in cdc_ncm driver, from
Bjørn Mork.
6) RX queue length reported for UDP sockets in procfs and socket diag
are inaccurate, from Paolo Abeni.
7) Fix br_fdb_find_port() locking, from Petr Machata.
8) Limit sk_rcvlowat values properly in TCP, from Soheil Hassas
Yeganeh.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits)
tcp: limit sk_rcvlowat by the maximum receive buffer
net: phy: dp83822: use BMCR_ANENABLE instead of BMSR_ANEGCAPABLE for DP83620
socket: close race condition between sock_close() and sockfs_setattr()
net: bridge: Fix locking in br_fdb_find_port()
udp: fix rx queue len reported by diag and proc interface
cdc_ncm: avoid padding beyond end of skb
net/sched: act_simple: fix parsing of TCA_DEF_DATA
net: fddi: fix a possible null-ptr-deref
net: aquantia: fix unsigned numvecs comparison with less than zero
net: stmmac: fix build failure due to missing COMMON_CLK dependency
bpfilter: fix race in pipe access
bpf, xdp: fix crash in xdp_umem_unaccount_pages
xsk: Fix umem fill/completion queue mmap on 32-bit
tools/bpf: fix selftest get_cgroup_id_user
bpfilter: fix OUTPUT_FORMAT
umh: fix race condition
net: mscc: ocelot: Fix uninitialized error in ocelot_netdevice_event()
bonding: re-evaluate force_primary when the primary slave name changes
ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
hv_netvsc: Fix a network regression after ifdown/ifup
...
The user-provided value to setsockopt(SO_RCVLOWAT) can be
larger than the maximum possible receive buffer. Such values
mute POLLIN signals on the socket which can stall progress
on the socket.
Limit the user-provided value to half of the maximum receive
buffer, i.e., half of sk_rcvbuf when the receive buffer size
is set by the user, or otherwise half of sysctl_tcp_rmem[2].
Fixes: d1361840f8 ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fchownat() doesn't even hold refcnt of fd until it figures out
fd is really needed (otherwise is ignored) and releases it after
it resolves the path. This means sock_close() could race with
sockfs_setattr(), which leads to a NULL pointer dereference
since typically we set sock->sk to NULL in ->release().
As pointed out by Al, this is unique to sockfs. So we can fix this
in socket layer by acquiring inode_lock in sock_close() and
checking against NULL in sockfs_setattr().
sock_release() is called in many places, only the sock_close()
path matters here. And fortunately, this should not affect normal
sock_close() as it is only called when the last fd refcnt is gone.
It only affects sock_close() with a parallel sockfs_setattr() in
progress, which is not common.
Fixes: 86741ec254 ("net: core: Add a UID field to struct sock.")
Reported-by: shankarapailoor <shankarapailoor@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Callers of br_fdb_find() need to hold the hash lock, which
br_fdb_find_port() doesn't do. However, since br_fdb_find_port() is not
doing any actual FDB manipulation, the hash lock is not really needed at
all. So convert to br_fdb_find_rcu(), surrounded by rcu_read_lock() /
_unlock() pair.
The device pointer copied from inside the FDB entry is then kept alive
by the RTNL lock, which br_fdb_find_port() asserts.
Fixes: 4d4fd36126 ("net: bridge: Publish bridge accessor functions")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 6b229cf77d ("udp: add batching to udp_rmem_release()")
the sk_rmem_alloc field does not measure exactly anymore the
receive queue length, because we batch the rmem release. The issue
is really apparent only after commit 0d4a6608f6 ("udp: do rmem bulk
free even if the rx sk queue is empty"): the user space can easily
check for an empty socket with not-0 queue length reported by the 'ss'
tool or the procfs interface.
We need to use a custom UDP helper to report the correct queue length,
taking into account the forward allocation deficit.
Reported-by: trevor.francis@46labs.com
Fixes: 6b229cf77d ("UDP: add batching to udp_rmem_release()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
end with '\0' character.
v2: fix errors in the commit message, thanks Hangbin Liu
Fixes: fa1b1cff3d ("net_cls_act: Make act_simple use of netlink policy.")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a 16-byte memory leak inside sunrpc/auth_gss on an nfs server when
a client mounts with 'sec=krb5' in a simple mount / umount loop. The leak
is seen by either monitoring the kmalloc-16 slab or with kmemleak enabled
unreferenced object 0xffff92e6a045f030 (size 16):
comm "nfsd", pid 1096, jiffies 4294936658 (age 761.110s)
hex dump (first 16 bytes):
2a 86 48 86 f7 12 01 02 02 00 00 00 00 00 00 00 *.H.............
backtrace:
[<000000004b2b79a7>] gssx_dec_buffer+0x79/0x90 [auth_rpcgss]
[<000000002610ac1a>] gssx_dec_accept_sec_context+0x215/0x6dd [auth_rpcgss]
[<000000004fd0e81d>] rpcauth_unwrap_resp+0xa9/0xe0 [sunrpc]
[<000000002b099233>] call_decode+0x1e9/0x840 [sunrpc]
[<00000000954fc846>] __rpc_execute+0x80/0x3f0 [sunrpc]
[<00000000c83a961c>] rpc_run_task+0x10d/0x150 [sunrpc]
[<000000002c2cdcd2>] rpc_call_sync+0x4d/0xa0 [sunrpc]
[<000000000b74eea2>] gssp_accept_sec_context_upcall+0x196/0x470 [auth_rpcgss]
[<000000003271273f>] svcauth_gss_proxy_init+0x188/0x520 [auth_rpcgss]
[<000000001cf69f01>] svcauth_gss_accept+0x3a6/0xb50 [auth_rpcgss]
If you map the above to code you'll see the following call chain
gssx_dec_accept_sec_context
gssx_dec_ctx (missing from kmemleak output)
gssx_dec_buffer(xdr, &ctx->mech)
Inside gssx_dec_buffer there is 'kmemdup' where we allocate memory for
any gssx_buffer (buf) and store into buf->data. In the above instance,
'buf == &ctx->mech).
Further up in the chain in gssp_accept_sec_context_upcall we see ctx->mech
is part of a stack variable 'struct gssx_ctx rctxh'. Now later inside
gssp_accept_sec_context_upcall after gssp_call, there is a number of
memcpy and kfree statements, but there is no kfree(rctxh.mech.data)
after the memcpy into data->mech_oid.data.
With this patch applied and the same mount / unmount loop, the kmalloc-16
slab is stable and kmemleak enabled no longer shows the above backtrace.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This crept in during the development process and wasn't caught
before I posted the "final" version.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 0b2613c5883f ('svcrdma: Allocate recv_ctxt's on CPU ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
syzbot reports following splat:
BUG: KMSAN: uninit-value in ebt_stp_mt_check+0x24b/0x450
net/bridge/netfilter/ebt_stp.c:162
ebt_stp_mt_check+0x24b/0x450 net/bridge/netfilter/ebt_stp.c:162
xt_check_match+0x1438/0x1650 net/netfilter/x_tables.c:506
ebt_check_match net/bridge/netfilter/ebtables.c:372 [inline]
ebt_check_entry net/bridge/netfilter/ebtables.c:702 [inline]
The uninitialised access is
xt_mtchk_param->nft_compat
... which should be set to 0.
Fix it by zeroing the struct beforehand, same for tgchk.
ip(6)tables targetinfo uses c99-style initialiser, so no change
needed there.
Reported-by: syzbot+da4494182233c23a5fcf@syzkaller.appspotmail.com
Fixes: 55917a21d0 ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
rwlock.h should not be included directly. Instead linux/splinlock.h
should be included. One thing it does is to break the RT build.
Link: http://lkml.kernel.org/r/20180504100319.11880-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently when detecting invalid options in option parsing, some
options(e.g. msize) just set errno and allow to continuously validate
other options so that it can detect invalid options as much as possible
and give proper error messages together.
This patch applies same rule to option 'trans' and 'version' when
detecting -EINVAL.
Link: http://lkml.kernel.org/r/1525340676-34072-1-git-send-email-cgxu519@gmx.com
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
syzbot reported the following crash
[ 338.293946] bpfilter: read fail -512
[ 338.304515] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 338.311863] general protection fault: 0000 [#1] SMP KASAN
[ 338.344360] RIP: 0010:__vfs_write+0x4a6/0x960
[ 338.426363] Call Trace:
[ 338.456967] __kernel_write+0x10c/0x380
[ 338.460928] __bpfilter_process_sockopt+0x1d8/0x35b
[ 338.487103] bpfilter_mbox_request+0x4d/0xb0
[ 338.491492] bpfilter_ip_get_sockopt+0x6b/0x90
This can happen when multiple cpus trying to talk to user mode process
via bpfilter_mbox_request(). One cpu grabs the mutex while another goes to
sleep on the same mutex. Then former cpu sees that umh pipe is down and
shuts down the pipes. Later cpu finally acquires the mutex and crashes
on freed pipe.
Fix the race by using info.pid as an indicator that umh and pipes are healthy
and check it after acquiring the mutex.
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Reported-by: syzbot+7ade6c94abb2774c0fee@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2018-06-08
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix in the BPF verifier to reject modified ctx pointers on helper
functions, from Daniel.
2) Fix in BPF kselftests for get_cgroup_id_user() helper to only
record the cgroup id for a provided pid in order to reduce test
failures from processes interferring with the test, from Yonghong.
3) Fix a crash in AF_XDP's mem accounting when the process owning
the sock has CAP_IPC_LOCK capabilities set, from Daniel.
4) Fix an issue for AF_XDP on 32 bit machines where XDP_UMEM_PGOFF_*_RING
defines need ULL suffixes and use loff_t type as they are otherwise
truncated, from Geert.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller was able to trigger the following panic for AF_XDP:
BUG: KASAN: null-ptr-deref in atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
BUG: KASAN: null-ptr-deref in atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
Write of size 8 at addr 0000000000000060 by task syz-executor246/4527
CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
kasan_report_error mm/kasan/report.c:352 [inline]
kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
check_memory_region_inline mm/kasan/kasan.c:260 [inline]
check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
xdp_umem_reg net/xdp/xdp_umem.c:334 [inline]
xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349
xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
__sys_setsockopt+0x1bd/0x390 net/socket.c:1935
__do_sys_setsockopt net/socket.c:1946 [inline]
__se_sys_setsockopt net/socket.c:1943 [inline]
__x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x49/0xbe
In xdp_umem_reg() the call to xdp_umem_account_pages() passed
with CAP_IPC_LOCK where we didn't need to end up charging rlimit
on memlock for the current user and therefore umem->user continues
to be NULL. Later on through fault injection syzkaller triggered
a failure in either umem->pgs or umem->pages allocation such that
we bail out and undo accounting in xdp_umem_unaccount_pages()
where we eventually hit the panic since it tries to deref the
umem->user.
The code is pretty close to mm_account_pinned_pages() and
mm_unaccount_pinned_pages() pair and potentially could reuse
it even in a later cleanup, and it appears that the initial
commit c0c77d8fb7 ("xsk: add user memory registration support
sockopt") got this right while later follow-up introduced the
bug via a49049ea25 ("xsk: simplified umem setup").
Fixes: a49049ea25 ("xsk: simplified umem setup")
Reported-by: syzbot+979217770b09ebf5c407@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
With gcc-4.1.2 on 32-bit:
net/xdp/xsk.c:663: warning: integer constant is too large for ‘long’ type
net/xdp/xsk.c:665: warning: integer constant is too large for ‘long’ type
Add the missing "ULL" suffixes to the large XDP_UMEM_PGOFF_*_RING values
to fix this.
net/xdp/xsk.c:663: warning: comparison is always false due to limited range of data type
net/xdp/xsk.c:665: warning: comparison is always false due to limited range of data type
"unsigned long" is 32-bit on 32-bit systems, hence the offset is
truncated, and can never be equal to any of the XDP_UMEM_PGOFF_*_RING
values. Use loff_t (and the required cast) to fix this.
Fixes: 423f38329d ("xsk: add umem fill queue support and mmap")
Fixes: fe2308328c ("xsk: add umem completion queue support and mmap")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
CONFIG_OUTPUT_FORMAT is x86 only macro.
Used objdump to extract elf file format.
Fixes: d2ba09c17a ("net: add skeleton of bpfilter kernel module")
Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
By passing a limit of 2 bytes to strncat, strncat is limited to writing
fewer bytes than what it's supposed to append to the name here.
Since the bounds are checked on the line above this, just remove the string
bounds checks entirely since they're unneeded.
Signed-off-by: Sultan Alsawaf <sultanxda@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tun, tap, virtio, packet and uml vector all use struct virtio_net_hdr
to communicate packet metadata to userspace.
For skbuffs with vlan, the first two return the packet as it may have
existed on the wire, inserting the VLAN tag in the user buffer. Then
virtio_net_hdr.csum_start needs to be adjusted by VLAN_HLEN bytes.
Commit f09e2249c4 ("macvtap: restore vlan header on user read")
added this feature to macvtap. Commit 3ce9b20f19 ("macvtap: Fix
csum_start when VLAN tags are present") then fixed up csum_start.
Virtio, packet and uml do not insert the vlan header in the user
buffer.
When introducing virtio_net_hdr_from_skb to deduplicate filling in
the virtio_net_hdr, the variant from macvtap which adds VLAN_HLEN was
applied uniformly, breaking csum offset for packets with vlan on
virtio and packet.
Make insertion of VLAN_HLEN optional. Convert the callers to pass it
when needed.
Fixes: e858fae2b0 ("virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
Fixes: 1276f24eee ("packet: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is internal, not exposed through uapi, and although it maps with
userspace LOG_*, with the introduction of LOGLEVEL_AUDIT we are
incurring in namespace pollution.
This patch adds the NFT_LOGLEVEL_ enumeration and use it from nft_log.
Fixes: 1a893b44de ("netfilter: nf_tables: Add audit support to log statement")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)
-----BEGIN PGP SIGNATURE-----
Comment: Kees Cook <kees@outflux.net>
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAlsYJ1gWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlCTEACwdEeriAd2VwxknnsstojGD/3g
8TTFA19vSu4Gxa6WiDkjGoSmIlfhXTlZo1Nlmencv16ytSvIVDNLUIB3uDxUIv1J
2+dyHML9JpXYHHR7zLXXnGFJL0wazqjbsD3NYQgXqmun7EVVYnOsAlBZ7h/Lwiej
jzEJd8DaHT3TA586uD3uggiFvQU0yVyvkDCDONIytmQx+BdtGdg9TYCzkBJaXuDZ
YIthyKDvxIw5nh/UaG3L+SKo73tUr371uAWgAfqoaGQQCWe+mxnWL4HkCKsjFzZL
u9ouxxF/n6pij3E8n6rb0i2fCzlsTDdDF+aqV1rQ4I4hVXCFPpHUZgjDPvBWbj7A
m6AfRHVNnOgI8HGKqBGOfViV+2kCHlYeQh3pPW33dWzy/4d/uq9NIHKxE63LH+S4
bY3oO2ela8oxRyvEgXLjqmRYGW1LB/ZU7FS6Rkx2gRzo4k8Rv+8K/KzUHfFVRX61
jEbiPLzko0xL9D53kcEn0c+BhofK5jgeSWxItdmfuKjLTW4jWhLRlU+bcUXb6kSS
S3G6aF+L+foSUwoq63AS8QxCuabuhreJSB+BmcGUyjthCbK/0WjXYC6W/IJiRfBa
3ZTxBC/2vP3uq/AGRNh5YZoxHL8mSxDfn62F+2cqlJTTKR/O+KyDb1cusyvk3H04
KCDVLYPxwQQqK1Mqig==
=/3L8
-----END PGP SIGNATURE-----
Merge tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull overflow updates from Kees Cook:
"This adds the new overflow checking helpers and adds them to the
2-factor argument allocators. And this adds the saturating size
helpers and does a treewide replacement for the struct_size() usage.
Additionally this adds the overflow testing modules to make sure
everything works.
I'm still working on the treewide replacements for allocators with
"simple" multiplied arguments:
*alloc(a * b, ...) -> *alloc_array(a, b, ...)
and
*zalloc(a * b, ...) -> *calloc(a, b, ...)
as well as the more complex cases, but that's separable from this
portion of the series. I expect to have the rest sent before -rc1
closes; there are a lot of messy cases to clean up.
Summary:
- Introduce arithmetic overflow test helper functions (Rasmus)
- Use overflow helpers in 2-factor allocators (Kees, Rasmus)
- Introduce overflow test module (Rasmus, Kees)
- Introduce saturating size helper functions (Matthew, Kees)
- Treewide use of struct_size() for allocators (Kees)"
* tag 'overflow-v4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
treewide: Use struct_size() for devm_kmalloc() and friends
treewide: Use struct_size() for vmalloc()-family
treewide: Use struct_size() for kmalloc()-family
device: Use overflow helpers for devm_kmalloc()
mm: Use overflow helpers in kvmalloc()
mm: Use overflow helpers in kmalloc_array*()
test_overflow: Add memory allocation overflow tests
overflow.h: Add allocation size calculation helpers
test_overflow: Report test failures
test_overflow: macrofy some more, do more tests for free
lib: add runtime test of check_*_overflow functions
compiler.h: enable builtin overflow checkers and add fallback code
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlsXFUEUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIoomg//eRNpc6x9kxTijN670AC2uD0CBTlZ
2z6mHuJaOhG8bTxjZxQfUBoo6/eZJ2YC1yq6ornGFNzw4sfKsR/j86ujJim2HAmo
opUhziq3SILGEvjsxfPkREe/wb49jy0AA/WjZqciitB1ig8Hz7xzqi0lpNaEspFh
QJFB6XXkojWGFGrRzruAVJnPS+pDWoTQR0qafs3JWKnpeinpOdZnl1hPsysAEHt5
Ag8o4qS/P9xJM0khi7T+jWECmTyT/mtWqEtFcZ0o+JLOgt/EMvNX6DO4ETDiYRD2
mVChga9x5r78bRgNy2U8IlEWWa76WpcQAEODvhzbijX4RxMAmjsmLE+e+udZSnMZ
eCITl2f7ExxrL5SwNFC/5h7pAv0RJ+SOC19vcyeV4JDlQNNVjUy/aNKv5baV0aeg
EmkeobneMWxqHx52aERz8RF1in5pT8gLOYoYnWfNpcDEmjLrwhuZLX2asIzUEqrS
SoPJ8hxIDCxceHOWIIrz5Dqef7x28Dyi46w3QINC8bSy2RnR/H3q40DRegvXOGiS
9WcbbwbhnM4Kau413qKicGCvdqTVYdeyZqo7fVelSciD139Vk7pZotyom4MuU25p
fIyGfXa8/8gkl7fZ+HNkZbba0XWNfAZt//zT095qsp3CkhVnoybwe6OwG1xRqErq
W7OOQbS7vvN/KGo=
=10u6
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Another reasonable chunk of audit changes for v4.18, thirteen patches
in total.
The thirteen patches can mostly be broken down into one of four
categories: general bug fixes, accessor functions for audit state
stored in the task_struct, negative filter matches on executable
names, and extending the (relatively) new seccomp logging knobs to the
audit subsystem.
The main driver for the accessor functions from Richard are the
changes we're working on to associate audit events with containers,
but I think they have some standalone value too so I figured it would
be good to get them in now.
The seccomp/audit patches from Tyler apply the seccomp logging
improvements from a few releases ago to audit's seccomp logging;
starting with this patchset the changes in
/proc/sys/kernel/seccomp/actions_logged should apply to both the
standard kernel logging and audit.
As usual, everything passes the audit-testsuite and it happens to
merge cleanly with your tree"
[ Heh, except it had trivial merge conflicts with the SELinux tree that
also came in from Paul - Linus ]
* tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Fix wrong task in comparison of session ID
audit: use existing session info function
audit: normalize loginuid read access
audit: use new audit_context access funciton for seccomp_actions_logged
audit: use inline function to set audit context
audit: use inline function to get audit context
audit: convert sessionid unset to a macro
seccomp: Don't special case audited processes when logging
seccomp: Audit attempts to modify the actions_logged sysctl
seccomp: Configurable separator for the actions_logged string
seccomp: Separate read and write code for actions_logged sysctl
audit: allow not equal op for audit by executable
audit: add syscall information to FEATURE_CHANGE records
Pull security system updates from James Morris:
- incorporate new socketpair() hook into LSM and wire up the SELinux
and Smack modules. From David Herrmann:
"The idea is to allow SO_PEERSEC to be called on AF_UNIX sockets
created via socketpair(2), and return the same information as if
you emulated socketpair(2) via a temporary listener socket.
Right now SO_PEERSEC will return the unlabeled credentials for a
socketpair, rather than the actual credentials of the creating
process."
- remove the unused security_settime LSM hook (Sargun Dhillon).
- remove some stack allocated arrays from the keys code (Tycho
Andersen)
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
dh key: get rid of stack allocated array for zeroes
dh key: get rid of stack allocated array
big key: get rid of stack array allocation
smack: provide socketpair callback
selinux: provide socketpair callback
net: hook socketpair() into LSM
security: add hook for socketpair()
security: remove security_settime
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
uses. It was done via automatic conversion with manual review for the
"CHECKME" non-standard cases noted below, using the following Coccinelle
script:
// pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
// sizeof *pkey_cache->table, GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
identifier VAR, ELEMENT;
expression COUNT;
@@
- alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
+ alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
// Same pattern, but can't trivially locate the trailing element name,
// or variable name.
@@
identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
expression GFP;
expression SOMETHING, COUNT, ELEMENT;
@@
- alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
+ alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
Signed-off-by: Kees Cook <keescook@chromium.org>