Default policy is defined as a unsigned 8-bit field, do not use a
negative value to leave it unset, use this new NFT_CHAIN_POLICY_UNSET
instead.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Disabling multicast filtering from NCSI if it is supported. As it
should not filter any multicast packets. In current code, multicast
filter is enabled and with an exception of optional field supported
by device are disabled filtering.
Mainly I see if goal is to disable filtering for IPV6 packets then let
it disabled for every other types as well. As we are seeing issues with
LLDP not working with this enabled filtering. And there are other issues
with IPV6.
By Disabling this multicast completely, it is working for both IPV6 as
well as LLDP.
Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Acked-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
This patch removes the 64B alignment of the UMEM headroom. There is
really no reason for it, and having a headroom less than 64B should be
valid.
Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Pull vfs mount API infrastructure updates from Al Viro:
"Infrastructure bits of mount API conversions.
The rest is more of per-filesystem updates and that will happen
in separate pull requests"
* 'work.mount-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mtd: Provide fs_context-aware mount_mtd() replacement
vfs: Create fs_context-aware mount_bdev() replacement
new helper: get_tree_keyed()
vfs: set fs_context::user_ns for reconfigure
Pull networking updates from David Miller:
1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.
2) Use bio_vec in the networking instead of custom skb_frag_t, from
Matthew Wilcox.
3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.
4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.
5) Support all variants of 5750X bnxt_en chips, from Michael Chan.
6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
Buslov.
7) Add TCP syn cookies bpf helper, from Petar Penkov.
8) Add 'nettest' to selftests and use it, from David Ahern.
9) Add extack support to drop_monitor, add packet alert mode and
support for HW drops, from Ido Schimmel.
10) Add VLAN offload to stmmac, from Jose Abreu.
11) Lots of devm_platform_ioremap_resource() conversions, from
YueHaibing.
12) Add IONIC driver, from Shannon Nelson.
13) Several kTLS cleanups, from Jakub Kicinski.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
mlxsw: spectrum: Register CPU port with devlink
mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
net: ena: fix incorrect update of intr_delay_resolution
net: ena: fix retrieval of nonadaptive interrupt moderation intervals
net: ena: fix update of interrupt moderation register
net: ena: remove all old adaptive rx interrupt moderation code from ena_com
net: ena: remove ena_restore_ethtool_params() and relevant fields
net: ena: remove old adaptive interrupt moderation code from ena_netdev
net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
net: ena: enable the interrupt_moderation in driver_supported_features
net: ena: reimplement set/get_coalesce()
net: ena: switch to dim algorithm for rx adaptive interrupt moderation
net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
ethtool: implement Energy Detect Powerdown support via phy-tunable
xen-netfront: do not assume sk_buff_head list is empty in error handling
s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
net: ena: don't wake up tx queue when down
drop_monitor: Better sanitize notified packets
...
Pull crypto updates from Herbert Xu:
"API:
- Add the ability to abort a skcipher walk.
Algorithms:
- Fix XTS to actually do the stealing.
- Add library helpers for AES and DES for single-block users.
- Add library helpers for SHA256.
- Add new DES key verification helper.
- Add surrounding bits for ESSIV generator.
- Add accelerations for aegis128.
- Add test vectors for lzo-rle.
Drivers:
- Add i.MX8MQ support to caam.
- Add gcm/ccm/cfb/ofb aes support in inside-secure.
- Add ofb/cfb aes support in media-tek.
- Add HiSilicon ZIP accelerator support.
Others:
- Fix potential race condition in padata.
- Use unbound workqueues in padata"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits)
crypto: caam - Cast to long first before pointer conversion
crypto: ccree - enable CTS support in AES-XTS
crypto: inside-secure - Probe transform record cache RAM sizes
crypto: inside-secure - Base RD fetchcount on actual RD FIFO size
crypto: inside-secure - Base CD fetchcount on actual CD FIFO size
crypto: inside-secure - Enable extended algorithms on newer HW
crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL
crypto: inside-secure - Add EIP97/EIP197 and endianness detection
padata: remove cpu_index from the parallel_queue
padata: unbind parallel jobs from specific CPUs
padata: use separate workqueues for parallel and serial work
padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
crypto: pcrypt - remove padata cpumask notifier
padata: make padata_do_parallel find alternate callback CPU
workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
workqueue: unconfine alloc/apply/free_workqueue_attrs()
padata: allocate workqueue internally
arm64: dts: imx8mq: Add CAAM node
random: Use wait_event_freezable() in add_hwgenerator_randomness()
crypto: ux500 - Fix COMPILE_TEST warnings
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQUwxxKyE5l/npt8ARiEGxRG/Sl2wUCXYAIeQAKCRBiEGxRG/Sl
2/SzAQDEnoNxzV/R5kWFd+2kmFeY3cll0d99KMrWJ8om+kje6QD/cXxZHzFm+T1L
UPF66k76oOODV7cyndjXnTnRXbeCRAM=
=Szby
-----END PGP SIGNATURE-----
Merge tag 'leds-for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED updates from Jacek Anaszewski:
"In this cycle we've finally managed to contribute the patch set
sorting out LED naming issues. Besides that there are many changes
scattered among various LED class drivers and triggers.
LED naming related improvements:
- add new 'function' and 'color' fwnode properties and deprecate
'label' property which has been frequently abused for conveying
vendor specific names that have been available in sysfs anyway
- introduce a set of standard LED_FUNCTION* definitions
- introduce a set of standard LED_COLOR_ID* definitions
- add a new {devm_}led_classdev_register_ext() API with the
capability of automatic LED name composition basing on the
properties available in the passed fwnode; the function is
backwards compatible in a sense that it uses 'label' data, if
present in the fwnode, for creating LED name
- add tools/leds/get_led_device_info.sh script for retrieving LED
vendor, product and bus names, if applicable; it also performs
basic validation of an LED name
- update following drivers and their DT bindings to use the new LED
registration API:
- leds-an30259a, leds-gpio, leds-as3645a, leds-aat1290, leds-cr0014114,
leds-lm3601x, leds-lm3692x, leds-lp8860, leds-lt3593, leds-sc27xx-blt
Other LED class improvements:
- replace {devm_}led_classdev_register() macros with inlines
- allow to call led_classdev_unregister() unconditionally
- switch to use fwnode instead of be stuck with OF one
LED triggers improvements:
- led-triggers:
- fix dereferencing of null pointer
- fix a memory leak bug
- ledtrig-gpio:
- GPIO 0 is valid
Drop superseeded apu2/3 support from leds-apu since for apu2+ a newer,
more complete driver exists, based on a generic driver for the AMD
SOCs gpio-controller, supporting LEDs as well other devices:
- drop profile field from priv data
- drop iosize field from priv data
- drop enum_apu_led_platform_types
- drop superseeded apu2/3 led support
- add pr_fmt prefix for better log output
- fix error message on probing failure
Other misc fixes and improvements to existing LED class drivers:
- leds-ns2, leds-max77650:
- add of_node_put() before return
- leds-pwm, leds-is31fl32xx:
- use struct_size() helper
- leds-lm3697, leds-lm36274, leds-lm3532:
- switch to use fwnode_property_count_uXX()
- leds-lm3532:
- fix brightness control for i2c mode
- change the define for the fs current register
- fixes for the driver for stability
- add full scale current configuration
- dt: Add property for full scale current.
- avoid potentially unpaired regulator calls
- move static keyword to the front of declarations
- fix optional led-max-microamp prop error handling
- leds-max77650:
- add of_node_put() before return
- add MODULE_ALIAS()
- Switch to fwnode property API
- leds-as3645a:
- fix misuse of strlcpy
- leds-netxbig:
- add of_node_put() in netxbig_leds_get_of_pdata()
- remove legacy board-file support
- leds-is31fl319x:
- simplify getting the adapter of a client
- leds-ti-lmu-common:
- fix coccinelle issue
- move static keyword to the front of declaration
- leds-syscon:
- use resource managed variant of device register
- leds-ktd2692:
- fix a typo in the name of a constant
- leds-lp5562:
- allow firmware files up to the maximum length
- leds-an30259a:
- fix typo
- leds-pca953x:
- include the right header"
* tag 'leds-for-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: (72 commits)
leds: lm3532: Fix optional led-max-microamp prop error handling
led: triggers: Fix dereferencing of null pointer
leds: ti-lmu-common: Move static keyword to the front of declaration
leds: lm3532: Move static keyword to the front of declarations
leds: trigger: gpio: GPIO 0 is valid
leds: pwm: Use struct_size() helper
leds: is31fl32xx: Use struct_size() helper
leds: ti-lmu-common: Fix coccinelle issue in TI LMU
leds: lm3532: Avoid potentially unpaired regulator calls
leds: syscon: Use resource managed variant of device register
leds: Replace {devm_}led_classdev_register() macros with inlines
leds: Allow to call led_classdev_unregister() unconditionally
leds: lm3532: Add full scale current configuration
dt: lm3532: Add property for full scale current.
leds: lm3532: Fixes for the driver for stability
leds: lm3532: Change the define for the fs current register
leds: lm3532: Fix brightness control for i2c mode
leds: Switch to use fwnode instead of be stuck with OF one
leds: max77650: Switch to fwnode property API
led: triggers: Fix a memory leak bug
...
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
Ensure that we set task->tk_rpc_status for all RPC level errors so that
the caller can distinguish between those and server reply status errors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If we've removed the request from the receive list, and have added
it back after resetting the request receive buffer, then we should
only receive message data if it is a new reply (i.e. if
transport->recv.copied is zero).
Fixes: 277e4ab7d5 ("SUNRPC: Simplify TCP receive code by switching...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that we dequeue the request from the transport receive queue
while we're re-encoding to prevent issues like use-after-free when
we release the bvec.
Fixes: 7536908982 ("SUNRPC: Ensure the bvecs are reset when we re-encode...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull RCU updates from Ingo Molnar:
"This cycle's RCU changes were:
- A few more RCU flavor consolidation cleanups.
- Updates to RCU's list-traversal macros improving lockdep usability.
- Forward-progress improvements for no-CBs CPUs: Avoid ignoring
incoming callbacks during grace-period waits.
- Forward-progress improvements for no-CBs CPUs: Use ->cblist
structure to take advantage of others' grace periods.
- Also added a small commit that avoids needlessly inflicting
scheduler-clock ticks on callback-offloaded CPUs.
- Forward-progress improvements for no-CBs CPUs: Reduce contention on
->nocb_lock guarding ->cblist.
- Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
list to further reduce contention on ->nocb_lock guarding ->cblist.
- Miscellaneous fixes.
- Torture-test updates.
- minor LKMM updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (86 commits)
MAINTAINERS: Update from paulmck@linux.ibm.com to paulmck@kernel.org
rcu: Don't include <linux/ktime.h> in rcutiny.h
rcu: Allow rcu_do_batch() to dynamically adjust batch sizes
rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload
rcu/nocb: Reduce __call_rcu_nocb_wake() leaf rcu_node ->lock contention
rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention
rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks()
rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake()
rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed
rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended
rcu/nocb: Add bypass callback queueing
rcu/nocb: Atomic ->len field in rcu_segcblist structure
rcu/nocb: Unconditionally advance and wake for excessive CBs
rcu/nocb: Reduce ->nocb_lock contention with separate ->nocb_gp_lock
rcu/nocb: Reduce contention at no-CBs invocation-done time
rcu/nocb: Reduce contention at no-CBs registry-time CB advancement
rcu/nocb: Round down for number of no-CBs grace-period kthreads
rcu/nocb: Avoid ->nocb_lock capture by corresponding CPU
rcu/nocb: Avoid needless wakeups of no-CBs grace-period kthread
rcu/nocb: Make __call_rcu_nocb_wake() safe for many callbacks
...
The `phy_tunable_id` has been named `ETHTOOL_PHY_EDPD` since it looks like
this feature is common across other PHYs (like EEE), and defining
`ETHTOOL_PHY_ENERGY_DETECT_POWER_DOWN` seems too long.
The way EDPD works, is that the RX block is put to a lower power mode,
except for link-pulse detection circuits. The TX block is also put to low
power mode, but the PHY wakes-up periodically to send link pulses, to avoid
lock-ups in case the other side is also in EDPD mode.
Currently, there are 2 PHY drivers that look like they could use this new
PHY tunable feature: the `adin` && `micrel` PHYs.
The ADIN's datasheet mentions that TX pulses are at intervals of 1 second
default each, and they can be disabled. For the Micrel KSZ9031 PHY, the
datasheet does not mention whether they can be disabled, but mentions that
they can modified.
The way this change is structured, is similar to the PHY tunable downshift
control:
* a `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` value is exposed to cover a default
TX interval; some PHYs could specify a certain value that makes sense
* `ETHTOOL_PHY_EDPD_NO_TX` would disable TX when EDPD is enabled
* `ETHTOOL_PHY_EDPD_DISABLE` will disable EDPD
As noted by the `ETHTOOL_PHY_EDPD_DFLT_TX_MSECS` the interval unit is 1
millisecond, which should cover a reasonable range of intervals:
- from 1 millisecond, which does not sound like much of a power-saver
- to ~65 seconds which is quite a lot to wait for a link to come up when
plugging a cable
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When working in 'packet' mode, drop monitor generates a notification
with a potentially truncated payload of the dropped packet. The payload
is copied from the MAC header, but I forgot to check that the MAC header
was set, so do it now.
Fixes: ca30707dee ("drop_monitor: Add packet alert mode")
Fixes: 5e58109b1e ("drop_monitor: Add support for packet alert mode for hardware drops")
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a preparation patch for the tc-taprio offload (and potentially
for other future offloads such as tc-mqprio).
Instead of looking directly at skb->priority during xmit, let's get the
netdev queue and the queue-to-traffic-class mapping, and put the
resulting traffic class into the dsa_8021q PCP field. The switch is
configured with a 1-to-1 PCP-to-ingress-queue-to-egress-queue mapping
(see vlan_pmap in sja1105_main.c), so the effect is that we can inject
into a front-panel's egress traffic class through VLAN tagging from
Linux, completely transparently.
Unfortunately the switch doesn't look at the VLAN PCP in the case of
management traffic to/from the CPU (link-local frames at
01-80-C2-xx-xx-xx or 01-1B-19-xx-xx-xx) so we can't alter the
transmission queue of this type of traffic on a frame-by-frame basis. It
is only selected through the "hostprio" setting which ATM is harcoded in
the driver to 7.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA currently handles shared block filters (for the classifier-action
qdisc) in the core due to what I believe are simply pragmatic reasons -
hiding the complexity from drivers and offerring a simple API for port
mirroring.
Extend the dsa_slave_setup_tc function by passing all other qdisc
offloads to the driver layer, where the driver may choose what it
implements and how. DSA is simply a pass-through in this case.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.
The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.
Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.
The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.
A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.
In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
performance problems --
> (1) Usually when we're diagnosing TCP performance problems, we do so
> from the sender, since the sender makes most of the
> performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> From the sender-side the thing that would be most useful is to see
> tp->snd_wnd, the receive window that the receiver has advertised to
> the sender.
This serves the purpose of adding an additional __u32 to avoid the
would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
Signed-off-by: Thomas Higdon <tph@fb.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For receive-heavy cases on the server-side, we want to track the
connection quality for individual client IPs. This counter, similar to
the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
tracks out-of-order packet reception. By providing this counter in
TCP_INFO, it will allow understanding to what degree receive-heavy
sockets are experiencing out-of-order delivery and packet drops
indicating congestion.
Please note that this is similar to the counter in NetBSD TCP_INFO, and
has the same name.
Also note that we avoid increasing the size of the tcp_sock struct by
taking advantage of a hole.
Signed-off-by: Thomas Higdon <tph@fb.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-09-16
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Now that initial BPF backend for gcc has been merged upstream, enable
BPF kselftest suite for bpf-gcc. Also fix a BE issue with access to
bpf_sysctl.file_pos, from Ilya.
2) Follow-up fix for link-vmlinux.sh to remove bash-specific extensions
related to recent work on exposing BTF info through sysfs, from Andrii.
3) AF_XDP zero copy fixes for i40e and ixgbe driver which caused umem
headroom to be added twice, from Ciara.
4) Refactoring work to convert sock opt tests into test_progs framework
in BPF kselftests, from Stanislav.
5) Fix a general protection fault in dev_map_hash_update_elem(), from Toke.
6) Cleanup to use BPF_PROG_RUN() macro in KCM, from Sami.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
osdmap has a bunch of arrays that grow linearly with the number of
OSDs. osd_state, osd_weight and osd_primary_affinity take 4 bytes per
OSD. osd_addr takes 136 bytes per OSD because of sockaddr_storage.
The CRUSH workspace area also grows linearly with the number of OSDs.
Normally these arrays are allocated at client startup. The osdmap is
usually updated in small incrementals, but once in a while a full map
may need to be processed. For a cluster with 10000 OSDs, this means
a bunch of 40K allocations followed by a 1.3M allocation, all of which
are currently required to be physically contiguous. This results in
sporadic ENOMEM errors, hanging the client.
Go back to manually (re)allocating arrays and use ceph_kvmalloc() to
fall back to non-contiguous allocation when necessary.
Link: https://tracker.ceph.com/issues/40481
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
The vmalloc allocator doesn't fully respect the specified gfp mask:
while the actual pages are allocated as requested, the page table pages
are always allocated with GFP_KERNEL. ceph_kvmalloc() may be called
with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
result in a deadlock.
There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
it's just something that seemed sensible at the time (ceph_kvmalloc()
predates kvmalloc()). kvmalloc() is smarter: in an attempt to reduce
long term fragmentation, it first tries to kmalloc non-disruptively.
Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
the scope API to avoid the deadlock. Note that kvmalloc() needs to be
passed GFP_KERNEL to enable the fallback.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
This bit was omitted from a561372405 ("libceph: fix PG split vs OSD
(re)connect race") to avoid backport conflicts.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
osd_req_op_cls_init() and osd_req_op_xattr_init() currently propagate
ceph_pagelist_alloc() ENOMEM errors but ignore ceph_pagelist_append()
memory allocation failures. Add these checks and cleanup on error.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This function also re-open connections to OSD/MON, and re-send in-flight
OSD requests after re-opening connections to OSD.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When filling in hardware intermediate representation tc_setup_flow_action()
directly obtains, checks and takes reference to dev used by mirred action,
instead of using act->ops->get_dev() API created specifically for this
purpose. In order to remove code duplication, refactor flow_action infra to
use action API when obtaining mirred action target dev. Extend get_dev()
with additional argument that is used to provide dev destructor to the
user.
Fixes: 5a6ff4b13d ("net: sched: take reference to action dev before calling offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
With recent patch set that removed rtnl lock dependency from cls hardware
offload API rtnl lock is only taken when reading action data and can be
released after action-specific data is parsed into intermediate
representation. However, sample action psample group is passed by pointer
without obtaining reference to it first, which makes it possible to
concurrently overwrite the action and deallocate object pointed by
psample_group pointer after rtnl lock is released but before driver
finished using the pointer.
To prevent such race condition, obtain reference to psample group while it
is used by flow_action infra. Extend psample API with function
psample_group_take() that increments psample group reference counter.
Extend struct tc_action_ops with new get_psample_group() API. Implement the
API for action sample using psample_group_take() and already existing
psample_group_put() as a destructor. Use it in tc_setup_flow_action() to
take reference to psample group pointed to by entry->sample.psample_group
and release it in tc_cleanup_flow_action().
Disable bh when taking psample_groups_lock. The lock is now taken while
holding action tcf_lock that is used by data path and requires bh to be
disabled, so doing the same for psample_groups_lock is necessary to
preserve SOFTIRQ-irq-safety.
Fixes: 918190f50e ("net: sched: flower: don't take rtnl lock for cls hw offloads API")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generalize flow_action_entry cleanup by extending the structure with
pointer to destructor function. Set the destructor in
tc_setup_flow_action(). Refactor tc_cleanup_flow_action() to call
entry->destructor() instead of using switch that dispatches by entry->id
and manually executes cleanup.
This refactoring is necessary for following patches in this series that
require destructor to use tc_action->ops callbacks that can't be easily
obtained in tc_cleanup_flow_action().
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In ip6erspan_tunnel_xmit(), if the skb will not be sent out, it has to
be freed on the tx_err path. Otherwise when deleting a netns, it would
cause dst/dev to leak, and dmesg shows:
unregister_netdevice: waiting for lo to become free. Usage count = 1
Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP reuseport groups can hold a mix unconnected and connected sockets.
Ensure that connections only receive all traffic to their 4-tuple.
Fast reuseport returns on the first reuseport match on the assumption
that all matches are equal. Only if connections are present, return to
the previous behavior of scoring all sockets.
Record if connections are present and if so (1) treat such connected
sockets as an independent match from the group, (2) only return
2-tuple matches from reuseport and (3) do not return on the first
2-tuple reuseport match to allow for a higher scoring match later.
New field has_conns is set without locks. No other fields in the
bitmap are modified at runtime and the field is only ever set
unconditionally, so an RMW cannot miss a change.
Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All entries in 'rds_ib_stat_names' are stringified versions
of the corresponding "struct rds_ib_statistics" element
without the "s_"-prefix.
Fix entry 'ib_evt_handler_call' to do the same.
Fixes: f4f943c958 ("RDS: IB: ack more receive completions to improve performance")
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcf_block_get() fails in sfb_init(), q->qdisc is still a NULL
pointer which leads to a crash in sfb_destroy(). Similar for
sch_dsmark.
Instead of fixing each separately, Linus suggested to just accept
NULL pointer in qdisc_put(), which would make callers easier.
(For sch_dsmark, the bug probably exists long before commit
6529eaba33f0.)
Fixes: 6529eaba33 ("net: sched: introduce tcf block infractructure")
Reported-by: syzbot+d5870a903591faaca4ae@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The DSA core, DSA taggers and DSA drivers all make use of
module_init(). Hence they get initialised at device_initcall() time.
The ordering is non-deterministic. It can be a DSA driver is bound to
a device before the needed tag driver has been initialised, resulting
in the message:
No tagger for this switch
Rather than have this be fatal, return -EPROBE_DEFER so that it is
tried again later once all the needed drivers have been loaded.
Fixes: d3b8c04988 ("dsa: Add boilerplate helper to register DSA tag driver modules")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The test implemented by some_qdisc_is_busy() is somewhat loosy for
NOLOCK qdisc, as we may hit the following scenario:
CPU1 CPU2
// in net_tx_action()
clear_bit(__QDISC_STATE_SCHED...);
// in some_qdisc_is_busy()
val = (qdisc_is_running(q) ||
test_bit(__QDISC_STATE_SCHED,
&q->state));
// here val is 0 but...
qdisc_run(q)
// ... CPU1 is going to run the qdisc next
As a conseguence qdisc_run() in net_tx_action() can race with qdisc_reset()
in dev_qdisc_reset(). Such race is not possible for !NOLOCK qdisc as
both the above bit operations are under the root qdisc lock().
After commit 021a17ed79 ("pfifo_fast: drop unneeded additional lock on dequeue")
the race can cause use after free and/or null ptr dereference, but the root
cause is likely older.
This patch addresses the issue explicitly checking for deactivation under
the seqlock for NOLOCK qdisc, so that the qdisc_run() in the critical
scenario becomes a no-op.
Note that the enqueue() op can still execute concurrently with dev_qdisc_reset(),
but that is safe due to the skb_array() locking, and we can't avoid that
for NOLOCK qdiscs.
Fixes: 021a17ed79 ("pfifo_fast: drop unneeded additional lock on dequeue")
Reported-by: Li Shuang <shuali@redhat.com>
Reported-and-tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the fact that devlink reload failed is stored in drivers.
Move this flag into devlink core. Also, expose it to the user.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to properly implement failure indication during reload,
split the reload op into two ops, one for down phase and one for
up phase.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are more parentheses in if clause when call sctp_get_port_local
in sctp_do_bind, and redundant assignment to 'ret'. This patch is to
do cleanup.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently sctp_get_port_local() returns a long
which is either 0,1 or a pointer casted to long.
It's neither of the callers use the return value since
commit 62208f1245 ("net: sctp: simplify sctp_get_port").
Now two callers are sctp_get_port and sctp_do_bind,
they actually assumend a casted to an int was the same as
a pointer casted to a long, and they don't save the return
value just check whether it is zero or non-zero, so
it would better change return type from long to int for
sctp_get_port_local.
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To resolve dependencies in following patches
mlx5_ib.h conflict resolved by keeing both hunks
Linux 5.3-rc8
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Enable setting skb->mark for UDP and RAW sockets using cmsg.
This is analogous to existing support for TOS, TTL, txtime, etc.
Packet sockets already support this as of commit c7d39e3263
("packet: support per-packet fwmark for af_packet sendmsg").
Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark
Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b ("ipv4: ipcm_cookie initializers").
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Fix error path of nf_tables_updobj(), from Dan Carpenter.
2) Move large structure away from stack in the nf_tables offload
infrastructure, from Arnd Bergmann.
3) Move indirect flow_block logic to nf_tables_offload.
4) Support for synproxy objects, from Fernando Fernandez Mancera.
5) Support for fwd and dup offload.
6) Add __nft_offload_get_chain() helper, this implicitly fixes missing
mutex and check for offload flags in the indirect block support,
patch from wenxu.
7) Remove rules on device unregistration, from wenxu. This includes
two preparation patches to reuse nft_flow_offload_chain() and
nft_flow_offload_rule().
Large batch from Jeremy Sowden to make a second pass to the
CONFIG_HEADER_TEST support and a bit of housekeeping:
8) Missing include guard in conntrack label header, from Jeremy Sowden.
9) A few coding style errors: trailing whitespace, incorrect indent in
Kconfig, and semicolons at the end of function definitions.
10) Remove unused ipt_init() and ip6t_init() declarations.
11) Inline xt_hashlimit, ebt_802_3 and xt_physdev headers. They are
only used once.
12) Update include directive in several netfilter files.
13) Remove unused include/net/netfilter/ipv6/nf_conntrack_icmpv6.h.
14) Move nf_ip6_ext_hdr() to include/linux/netfilter_ipv6.h
15) Move several synproxy structure definitions to nf_synproxy.h
16) Move nf_bridge_frag_data structure to include/linux/netfilter_bridge.h
17) Clean up static inline definitions in nf_conntrack_ecache.h.
18) Replace defined(CONFIG...) || defined(CONFIG...MODULE) with IS_ENABLED(CONFIG...).
19) Missing inline function conditional definitions based on Kconfig
preferences in synproxy and nf_conntrack_timeout.
20) Update br_nf_pre_routing_ipv6() definition.
21) Move conntrack code in linux/skbuff.h to nf_conntrack headers.
22) Several patches to remove superfluous CONFIG_NETFILTER and
CONFIG_NF_CONNTRACK checks in headers, coming from the initial batch
support for CONFIG_HEADER_TEST for netfilter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some `struct nf_conntrack` code from linux/skbuff.h to
linux/nf_conntrack_common.h. Together with a couple of helpers for
getting and setting skb->_nfct, it allows us to remove
CONFIG_NF_CONNTRACK checks from net/netfilter/nf_conntrack.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is a struct definition function in nf_conntrack_bridge.h which is
not specific to conntrack and is used elswhere in netfilter. Move it
into netfilter_bridge.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is an inline function in ip6_tables.h which is not specific to
ip6tables and is used elswhere in netfilter. Move it into
netfilter_ipv6.h and update the callers.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_icmpv6.h contains two object macros which duplicate macros
in linux/icmpv6.h. The latter definitions are also visible wherever it
is included, so remove it.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Include some headers in files which require them, and remove others
which are not required.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Three netfilter headers are only included once. Inline their contents
at those sites and remove them.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Several header-files, Kconfig files and Makefiles have trailing
white-space. Remove it.
In netfilter/Kconfig, indent the type of CONFIG_NETFILTER_NETLINK_ACCT
correctly.
There are semicolons at the end of two function definitions in
include/net/netfilter/nf_conntrack_acct.h and
include/net/netfilter/nf_conntrack_ecache.h. Remove them.
Fix indentation in nf_conntrack_l4proto.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the net_device unregisters, clean up the offload rules before the
chain is destroy.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass rule, chain and flow_rule object parameters to nft_flow_offload_rule
to reuse it.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pass chain and policy parameters to nft_flow_offload_chain to reuse it.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add __nft_offload_get_chain function to get basechain from device. This
function requires that caller holds the per-netns nftables mutex. This
patch implicitly fixes missing offload flags check and proper mutex from
nft_indr_block_cb().
Fixes: 9a32669fec ("netfilter: nf_tables_offload: support indr block call")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The '.exit' functions from 'pernet_operations' structure should be marked
as __net_exit, not __net_init.
Fixes: 8e2d61e0ae ("sctp: fix race on protocol/netns initialization")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In qrtr_tun_write_iter the allocated kbuf should be release in case of
error or success return.
v2 Update: Thanks to David Miller for pointing out the release on success
path as well.
Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In event of failure during register_netdevice, free_netdev is
invoked immediately. free_netdev assumes that all the netdevice
refcounts have been dropped prior to it being called and as a
result frees and clears out the refcount pointer.
However, this is not necessarily true as some of the operations
in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
invocation after a grace period. The IPv4 callback in_dev_rcu_put
tries to access the refcount after free_netdev is called which
leads to a null de-reference-
44837.761523: <6> Unable to handle kernel paging request at
virtual address 0000004a88287000
44837.761651: <2> pc : in_dev_finish_destroy+0x4c/0xc8
44837.761654: <2> lr : in_dev_finish_destroy+0x2c/0xc8
44837.762393: <2> Call trace:
44837.762398: <2> in_dev_finish_destroy+0x4c/0xc8
44837.762404: <2> in_dev_rcu_put+0x24/0x30
44837.762412: <2> rcu_nocb_kthread+0x43c/0x468
44837.762418: <2> kthread+0x118/0x128
44837.762424: <2> ret_from_fork+0x10/0x1c
Fix this by waiting for the completion of the call_rcu() in
case of register_netdevice errors.
Fixes: 93ee31f14f ("[NET]: Fix free_netdev on register_netdev failure.")
Cc: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the superfluous NET_DSA_TAG_KSZ_COMMON and just use the existing
NET_DSA_TAG_KSZ. Update the description to mention the three switch
families it supports. No functional change.
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The '.exit' functions from 'pernet_operations' structure should be marked
as __net_exit, not __net_init.
Fixes: d862e54614 ("net: ipv6: Implement /proc/net/icmp6.")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tcp sends a TSO packet, adding a PSH flag on it
reduces the sojourn time of GRO packet in GRO receivers.
This is particularly the case under pressure, since RX queues
receive packets for many concurrent flows.
A sender can give a hint to GRO engines when it is
appropriate to flush a super-packet, especially when pacing
is in the picture, since next packet is probably delayed by
one ms.
Having less packets in GRO engine reduces chance
of LRU eviction or inflated RTT, and reduces GRO cost.
We found recently that we must not set the PSH flag on
individual full-size MSS segments [1] :
Under pressure (CWR state), we better let the packet sit
for a small delay (depending on NAPI logic) so that the
ACK packet is delayed, and thus next packet we send is
also delayed a bit. Eventually the bottleneck queue can
be drained. DCTCP flows with CWND=1 have demonstrated
the issue.
This patch allows to slowdown the aggregate traffic without
involving high resolution timers on senders and/or
receivers.
It has been used at Google for about four years,
and has been discussed at various networking conferences.
[1] segments smaller than MSS already have PSH flag set
by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
has been requested by the user.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.
Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.
The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone. After this fix, the
sender properly reflects the CE bit and does not set the W bit.
Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).
Fixes: bdf1ee5d3b ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the equivalent of commit 2c6b55f45d ("ipv6: fix neighbour
resolution with raw socket") for ip6_confirm_neigh(): we can send a
packet with MSG_CONFIRM on a raw socket for a connected route, so the
gateway would be :: here, and we should pick the next hop using
rt6_nexthop() instead.
This was found by code review and, to the best of my knowledge, doesn't
actually fix a practical issue: the destination address from the packet
is not considered while confirming a neighbour, as ip6_confirm_neigh()
calls choose_neigh_daddr() without passing the packet, so there are no
similar issues as the one fixed by said commit.
A possible source of issues with the existing implementation might come
from the fact that, if we have a cached dst, we won't consider it,
while rt6_nexthop() takes care of that. I might just not be creative
enough to find a practical problem here: the only way to affect this
with cached routes is to have one coming from an ICMPv6 redirect, but
if the next hop is a directly connected host, there should be no
topology for which a redirect applies here, and tests with redirected
routes show no differences for MSG_CONFIRM (and MSG_PROBE) packets on
raw sockets destined to a directly connected host.
However, directly using the dst gateway here is not consistent anymore
with neighbour resolution, and, in general, as we want the next hop,
using rt6_nexthop() looks like the only sane way to fetch it.
Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rds_bind(), an rds_sock is added to the RDS bind hash table before
rs_transport is set. This means that the socket can be found by the
receive code path when rs_transport is NULL. And the receive code
path de-references rs_transport for congestion update check. This can
cause a panic. An rds_sock should not be added to the bind hash table
before all the needed fields are set.
Reported-by: syzbot+4b4f8163c2e246df3c4c@syzkaller.appspotmail.com
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Layer 2 Update frame is used to update bridges when a station roams
to another AP even if that STA does not transmit any frames after the
reassociation. This behavior was described in IEEE Std 802.11F-2003 as
something that would happen based on MLME-ASSOCIATE.indication, i.e.,
before completing 4-way handshake. However, this IEEE trial-use
recommended practice document was published before RSN (IEEE Std
802.11i-2004) and as such, did not consider RSN use cases. Furthermore,
IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been
maintained amd should not be used anymore.
Sending out the Layer 2 Update frame immediately after association is
fine for open networks (and also when using SAE, FT protocol, or FILS
authentication when the station is actually authenticated by the time
association completes). However, it is not appropriate for cases where
RSN is used with PSK or EAP authentication since the station is actually
fully authenticated only once the 4-way handshake completes after
authentication and attackers might be able to use the unauthenticated
triggering of Layer 2 Update frame transmission to disrupt bridge
behavior.
Fix this by postponing transmission of the Layer 2 Update frame from
station entry addition to the point when the station entry is marked
authorized. Similarly, send out the VLAN binding update only if the STA
entry has already been authorized.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
* a fix in the new 6 GHz channel support
* a fix for recent minstrel (rate control) updates
for an infinite loop
* handle interface type changes better wrt. management frame
registrations (for management frames sent to userspace)
* add in-BSS RX time to survey information
* handle HW rfkill properly if !CONFIG_RFKILL
* send deauth on IBSS station expiry, to avoid state mismatches
* handle deferred crypto tailroom updates in mac80211 better
when device restart happens
* fix a spectre-v1 - really a continuation of a previous patch
* advertise NL80211_CMD_UPDATE_FT_IES as supported if so
* add some missing parsing in VHT extended NSS support
* support HE in mac80211_hwsim
* let mac80211 drivers determine the max MTU themselves
along with the usual cleanups etc.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl148l4ACgkQB8qZga/f
l8TafQ//Yjunnxq49ClKMDxyKwxIBvLRAMr4D3QwaWAcWUpar1/V0Ft/5glnHtbZ
7QptwbxVNUk/N68hYi8wlpMpvfzv/nShZD8QBNS1bk5E3Gng3yow03LhOx5iaYb2
KJXS1GE4jwkMD7Xn65+eMeb8rt1vEj8LleX91cguilq+y5YbcNFsP1nil2RtQyBU
cf5i8CBu4I5rTBoFaRvcz2xn+blqPSm2/piA+yXjzFp9vyVmhD+FjR5T482u48pj
wi/1zersGVUzBNElnZOKg67XPir1fcJqCfLILr7okPItWuYXHdVHGfn+arhimK4W
dyIpv1EfCe0lwKl4VTdhXt1GwhKvWCc2Ja7lz/RnDisGq9CYPJNfW7jFgR3tw4eg
SccnUhnxRIgD1V2KDcvsRadPo+2YsBJBmC61JRX1K6L3zpHLbktDyzxvTwCeVQ9A
TQp5bmQcKqDoq+/60JNHI6IxMbmX/vc2PC7dENGWUkem/JEmSWBB1wcL9gsRkhVi
c4uHyvFeXJaIV7YFA36hWCfa0fr+UUYxfzxudRSvxq/tTpayqBKu6fr3Pvv0SqTj
/BKkezoIdJntClhJv4PcDkZMva4uMCPtHCero9eICX5J+4AanD6caRNefX6L0PfF
DtN1sscVOy9zY2fV4tqvn3IASmOE5yB/dxjtKkMYyNHkZsGiDI0=
=HAuT
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-davem-2019-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
We have a number of changes, but things are settling down:
* a fix in the new 6 GHz channel support
* a fix for recent minstrel (rate control) updates
for an infinite loop
* handle interface type changes better wrt. management frame
registrations (for management frames sent to userspace)
* add in-BSS RX time to survey information
* handle HW rfkill properly if !CONFIG_RFKILL
* send deauth on IBSS station expiry, to avoid state mismatches
* handle deferred crypto tailroom updates in mac80211 better
when device restart happens
* fix a spectre-v1 - really a continuation of a previous patch
* advertise NL80211_CMD_UPDATE_FT_IES as supported if so
* add some missing parsing in VHT extended NSS support
* support HE in mac80211_hwsim
* let mac80211 drivers determine the max MTU themselves
along with the usual cleanups etc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently frame registrations are not purged, even when changing the
interface type. This can lead to potentially weird situations where
frames possibly not allowed on a given interface type remain registered
due to the type switching happening after registration.
The kernel currently relies on userspace apps to actually purge the
registrations themselves, this is not something that the kernel should
rely on.
Add a call to cfg80211_mlme_purge_registrations() to forcefully remove
any registrations left over prior to switching the iftype.
Cc: stable@vger.kernel.org
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Link: https://lore.kernel.org/r/20190828211110.15005-1-denkenz@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
commit 1222a16014 ("nl80211: Fix possible Spectre-v1 for CQM
RSSI thresholds") was incomplete and requires one more fix to
prevent accessing to rssi_thresholds[n] because user can control
rssi_thresholds[i] values to make i reach to n. For example,
rssi_thresholds = {-400, -300, -200, -100} when last is -34.
Cc: stable@vger.kernel.org
Fixes: 1222a16014 ("nl80211: Fix possible Spectre-v1 for CQM RSSI thresholds")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Masashi Honma <masashi.honma@gmail.com>
Link: https://lore.kernel.org/r/20190908005653.17433-1-masashi.honma@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Make it possibly for drivers to adjust the default max_mtu
by storing it in the hardware struct and using that value
for all interfaces.
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1567738137-31748-1-git-send-email-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With the help of boolinit.cocci, we use !nl80211_reg_change_event_fill
instead of (nl80211_reg_change_event_fill == false). Meanwhile, Clean
up the code.
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Link: https://lore.kernel.org/r/1567657537-65472-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we expire an inactive station, try to send it a deauth. This
helps if it's actually still around, and just has issues with
beacon distribution (or we do), and it will not also remove us.
Then, if we have shared state, this may not be reset properly,
causing problems; for example, we saw a case where aggregation
sessions weren't removed properly (due to the TX start being
offloaded to firmware and it relying on deauth for stop), causing
a lot of traffic to get lost due to the SN reset after remove/add
of the peer.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830112451.21655-9-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
We already assume that key is not NULL and dereference it in a few
other places before we check whether it is NULL, so the check is
unnecessary. Remove it.
Fixes: 96fc6efb9a ("mac80211: IEEE 802.11 Extended Key ID support")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830112451.21655-8-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In case we got a fw restart while roaming from encrypted AP to
non-encrypted one, we might end up with hitting a warning on the pending
counter crypto_tx_tailroom_pending_dec having a non-zero value.
The following comment taken from net/mac80211/key.c explains the rational
for the delayed tailroom needed:
/*
* The reason for the delayed tailroom needed decrementing is to
* make roaming faster: during roaming, all keys are first deleted
* and then new keys are installed. The first new key causes the
* crypto_tx_tailroom_needed_cnt to go from 0 to 1, which invokes
* the cost of synchronize_net() (which can be slow). Avoid this
* by deferring the crypto_tx_tailroom_needed_cnt decrementing on
* key removal for a while, so if we roam the value is larger than
* zero and no 0->1 transition happens.
*
* The cost is that if the AP switching was from an AP with keys
* to one without, we still allocate tailroom while it would no
* longer be needed. However, in the typical (fast) roaming case
* within an ESS this usually won't happen.
*/
The next flow lead to the warning eventually reported as a bug:
1. Disconnect from encrypted AP
2. Set crypto_tx_tailroom_pending_dec = 1 for the key
3. Schedule work
4. Reconnect to non-encrypted AP
5. Add a new key, setting the tailroom counter = 1
6. Got FW restart while pending counter is set ---> hit the warning
While on it, the ieee80211_reset_crypto_tx_tailroom() func was merged into
its single caller ieee80211_reenable_keys (previously called
ieee80211_enable_keys). Also, we reset the crypto_tx_tailroom_pending_dec
and remove the counters warning as we just reset both.
Signed-off-by: Lior Cohen <lior2.cohen@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830112451.21655-7-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When we reach this point, the key cannot be NULL. Remove the condition
that suggests otherwise.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830112451.21655-6-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
"HE/HT/VHT" is a bit confusing since really the order of
development (and possible support) is different - change
this to "HT/VHT/HE".
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830112451.21655-4-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When the RFKILL subsystem isn't available, then rfkill_blocked()
always returns false. In the case of hardware rfkill this will
be wrong though, as if the hardware reported being killed then
it cannot operate any longer.
Since we only ever call the rfkill_sync work in this case, just
rename it to rfkill_block and always pass "true" for the blocked
parameter, rather than passing rfkill_blocked().
We rely on the underlying driver to still reject any new attempt
to bring up the device by itself.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830112451.21655-2-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This fixes was missed in parsing the vht capabilities max bw
support.
Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Fixes: e80d642552 ("mac80211: copy VHT EXT NSS BW Support/Capable data to station")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20190830114057.22197-1-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The boundary value used for the 6G band was incorrect as it would
result in invalid 6G channel number for certain frequencies.
Reported-by: Amar Singhal <asinghal@codeaurora.org>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1567510772-24263-1-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds support for packet mirroring and redirection. The
nft_fwd_dup_netdev_offload() function configures the flow_action object
for the fwd and the dup actions.
Extend nft_flow_rule_destroy() to release the net_device object when the
flow_rule object is released, since nft_fwd_dup_netdev_offload() bumps
the net_device reference counter.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
Register a new synproxy stateful object type into the stateful object
infrastructure.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This issue causes SCTP_PEER_ADDR_THLDS sockopt not to be able to dump
a transport thresholds info.
Fix it by adding 'goto' put_user in sctp_getsockopt_paddr_thresholds.
Fixes: 8add543e36 ("sctp: add SCTP_FUTURE_ASSOC for SCTP_PEER_ADDR_THLDS sockopt")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of TCA_HHF_NON_HH_WEIGHT or TCA_HHF_QUANTUM is zero,
it would make no progress inside the loop in hhf_dequeue() thus
kernel would get stuck.
Fix this by checking this corner case in hhf_change().
Fixes: 10239edf86 ("net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc")
Reported-by: syzbot+bc6297c11f19ee807dc2@syzkaller.appspotmail.com
Reported-by: syzbot+041483004a7f45f1f20a@syzkaller.appspotmail.com
Reported-by: syzbot+55be5f513bed37fc4367@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Terry Lam <vtlam@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At least sch_red and sch_tbf don't implement ->tcf_block()
while still have a non-zero tc "class".
Instead of adding nop implementations to each of such qdisc's,
we can just relax the check of cops->tcf_block() in
tc_bind_tclass(). They don't support TC filter anyway.
Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the 'reset_dev_on_drv_probe' devlink parameter, controlling the
device reset policy on driver probe.
This parameter is useful in conjunction with the existing
'fw_load_policy' parameter.
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent at the end.
In fact, NLMSG_DONE is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: 949f1e39a6 ("bridge: mdb: notify on router port add and del")
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add nft_offload_init() and nft_offload_exit() function to deal with the
init and the exit path of the offload infrastructure.
Rename nft_indr_block_get_and_ing_cmd() to nft_indr_block_cb().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The nft_offload_ctx structure is much too large to put on the
stack:
net/netfilter/nf_tables_offload.c:31:23: error: stack frame size of 1200 bytes in function 'nft_flow_rule_create' [-Werror,-Wframe-larger-than=]
Use dynamic allocation here, as we do elsewhere in the same
function.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The "newobj" is an error pointer so we can't pass it to kfree(). It
doesn't need to be freed so we can remove that and I also renamed the
error label.
Fixes: d62d0ba97b ("netfilter: nf_tables: Introduce stateful object update operation")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Unlike normal TCP code TLS has to touch the cache lines
it copies into to fill header info. On memory-heavy workloads
having non temporal stores and normal accesses targeting
the same cache line leads to significant overhead.
Measured 3% overhead running 3600 round robin connections
with additional memory heavy workload.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TLS device offload the tag/message authentication code are
filled in by the device. The kernel merely reserves space for
them. Because device overwrites it, the contents of the tag make
do no matter. Current code tries to save space by reusing the
header as the tag. This, however, leads to an additional frag
being created and defeats buffer coalescing (which trickles
all the way down to the drivers).
Remove this optimization, and try to allocate the space for
the tag in the usual way, leave the memory uninitialized.
If memory allocation fails rewind the record pointer so that
we use the already copied user data as tag.
Note that the optimization was actually buggy, as the tag
for TLS 1.2 is 16 bytes, but header is just 13, so the reuse
may had looked past the end of the page..
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All modifications to TLS record list happen under the socket
lock. Since records form an ordered queue readers are only
concerned about elements being removed, additions can happen
concurrently.
Use RCU primitives to ensure the correct access types
(READ_ONCE/WRITE_ONCE).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's generally more cache friendly to walk arrays in order,
especially those which are likely not in cache.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth-next 2019-09-06
Here's the main bluetooth-next pull request for the 5.4 kernel.
- Cleanups & fixes to btrtl driver
- Fixes for Realtek devices in btusb, e.g. for suspend handling
- Firmware loading support for BCM4345C5
- hidp_send_message() return value handling fixes
- Added support for utilizing Fast Advertising Interval
- Various other minor cleanups & fixes
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Historically, support for frag_list packets entering skb_segment() was
limited to frag_list members terminating on exact same gso_size
boundaries. This is verified with a BUG_ON since commit 89319d3801
("net: Add frag_list support to skb_segment"), quote:
As such we require all frag_list members terminate on exact MSS
boundaries. This is checked using BUG_ON.
As there should only be one producer in the kernel of such packets,
namely GRO, this requirement should not be difficult to maintain.
However, since commit 6578171a7f ("bpf: add bpf_skb_change_proto helper"),
the "exact MSS boundaries" assumption no longer holds:
An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but
leaves the frag_list members as originally merged by GRO with the
original 'gso_size'. Example of such programs are bpf-based NAT46 or
NAT64.
This lead to a kernel BUG_ON for flows involving:
- GRO generating a frag_list skb
- bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room()
- skb_segment() of the skb
See example BUG_ON reports in [0].
In commit 13acc94eff ("net: permit skb_segment on head_frag frag_list skb"),
skb_segment() was modified to support the "gso_size mangling" case of
a frag_list GRO'ed skb, but *only* for frag_list members having
head_frag==true (having a page-fragment head).
Alas, GRO packets having frag_list members with a linear kmalloced head
(head_frag==false) still hit the BUG_ON.
This commit adds support to skb_segment() for a 'head_skb' packet having
a frag_list whose members are *non* head_frag, with gso_size mangled, by
disabling SG and thus falling-back to copying the data from the given
'head_skb' into the generated segmented skbs - as suggested by Willem de
Bruijn [1].
Since this approach involves the penalty of skb_copy_and_csum_bits()
when building the segments, care was taken in order to enable this
solution only when required:
- untrusted gso_size, by testing SKB_GSO_DODGY is set
(SKB_GSO_DODGY is set by any gso_size mangling functions in
net/core/filter.c)
- the frag_list is non empty, its item is a non head_frag, *and* the
headlen of the given 'head_skb' does not match the gso_size.
[0]
https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/
[1]
https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a re-post of previous patch wrote by David Miller[1].
Phil Karn reported[2] that on busy networks with lots of unresolved
multicast routing entries, the creation of new multicast group routes
can be extremely slow and unreliable.
The reason is we hard-coded multicast route entries with unresolved source
addresses(cache_resolve_queue_len) to 10. If some multicast route never
resolves and the unresolved source addresses increased, there will
be no ability to create new multicast route cache.
To resolve this issue, we need either add a sysctl entry to make the
cache_resolve_queue_len configurable, or just remove cache_resolve_queue_len
limit directly, as we already have the socket receive queue limits of mrouted
socket, pointed by David.
>From my side, I'd perfer to remove the cache_resolve_queue_len limit instead
of creating two more(IPv4 and IPv6 version) sysctl entry.
[1] https://lkml.org/lkml/2018/7/22/11
[2] https://lkml.org/lkml/2018/7/21/343
v3: instead of remove cache_resolve_queue_len totally, let's only remove
the hard code limit when allocate the unresolved cache, as Eric Dumazet
suggested, so we don't need to re-count it in other places.
v2: hold the mfc_unres_lock while walking the unresolved list in
queue_count(), as Nikolay Aleksandrov remind.
Reported-by: Phil Karn <karn@ka9q.net>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes a stupid bug I recently introduced...
ip6_route_info_create() returns an ERR_PTR(err) and not a NULL on error.
Fixes: d55a2e374a ("net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)'")
Cc: David Ahern <dsahern@gmail.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_diag_get_aux_size() can be called with sockets in any state.
icsk_ulp_ops is only present for full sockets.
For SYN_RECV or TIME_WAIT ones we would access garbage.
Fixes: 61723b3932 ("tcp: ulp: add functions to dump ulp-specific information")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Luke Hsiao <lukehsiao@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No need for fib_notifier_ops to be in struct net. It is used only by
fib_notifier as a private data. Use net_generic to introduce per-net
fib_notifier struct and move fib_notifier_ops there.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add nft_reg_store64() and nft_reg_load64() helpers, from Ander Juaristi.
2) Time matching support, also from Ander Juaristi.
3) VLAN support for nfnetlink_log, from Michael Braun.
4) Support for set element deletions from the packet path, also from Ander.
5) Remove __read_mostly from conntrack spinlock, from Li RongQing.
6) Support for updating stateful objects, this also includes the initial
client for this infrastructure: the quota extension. A follow up fix
for the control plane also comes in this batch. Patches from
Fernando Fernandez Mancera.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of invoking struct bpf_prog::bpf_func directly, use the
BPF_PROG_RUN macro.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add the ability to use unaligned chunks in the AF_XDP umem. By
relaxing where the chunks can be placed, it allows to use an
arbitrary buffer size and place whenever there is a free
address in the umem. Helps more seamless DPDK AF_XDP driver
integration. Support for i40e, ixgbe and mlx5e, from Kevin and
Maxim.
2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
application can wake up the kernel for rx/tx processing which
avoids busy-spinning of the latter, useful when app and driver
is located on the same core. Support for i40e, ixgbe and mlx5e,
from Magnus and Maxim.
3) bpftool fixes for printf()-like functions so compiler can actually
enforce checks, bpftool build system improvements for custom output
directories, and addition of 'bpftool map freeze' command, from Quentin.
4) Support attaching/detaching XDP programs from 'bpftool net' command,
from Daniel.
5) Automatic xskmap cleanup when AF_XDP socket is released, and several
barrier/{read,write}_once fixes in AF_XDP code, from Björn.
6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
inclusion as well as libbpf versioning improvements, from Andrii.
7) Several new BPF kselftests for verifier precision tracking, from Alexei.
8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.
9) And more BPF kselftest improvements all over the place, from Stanislav.
10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.
11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.
12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.
13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.
14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.
15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
Peter, Wei, Yue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
hidp_send_message was changed to return non-zero values on success,
which some other bits did not expect. This caused spurious errors to be
propagated through the stack, breaking some drivers, such as hid-sony
for the Dualshock 4 in Bluetooth mode.
As pointed out by Dan Carpenter, hid-microsoft directly relied on that
assumption as well.
Fixes: 48d9cc9d85 ("Bluetooth: hidp: Let hidp_send_message return number of queued bytes")
Signed-off-by: Dan Elkouby <streetwalkermc@gmail.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Whenever MQ is not used on a multiqueue device, we experience
serious reordering problems. Bisection found the cited
commit.
The issue can be described this way :
- A single qdisc hierarchy is shared by all transmit queues.
(eg : tc qdisc replace dev eth0 root fq_codel)
- When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting
a different transmit queue than the one used to build a packet train,
we stop building the current list and save the 'bad' skb (P1) in a
special queue. (bad_txq)
- When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this
skb (P1), it checks if the associated transmit queues is still in frozen
state. If the queue is still blocked (by BQL or NIC tx ring full),
we leave the skb in bad_txq and return NULL.
- dequeue_skb() calls q->dequeue() to get another packet (P2)
The other packet can target the problematic queue (that we found
in frozen state for the bad_txq packet), but another cpu just ran
TX completion and made room in the txq that is now ready to accept
new packets.
- Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent
at next round. In practice P2 is the lead of a big packet train
(P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/
To solve this problem, we have to block the dequeue process as long
as the first packet in bad_txq can not be sent. Reordering issues
disappear and no side effects have been seen.
Fixes: a53851e2c3 ("net: sched: explicit locking in gso_cpu fallback")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Steffen Klassert says:
====================
pull request (net): ipsec 2019-09-05
1) Several xfrm interface fixes from Nicolas Dichtel:
- Avoid an interface ID corruption on changelink.
- Fix wrong intterface names in the logs.
- Fix a list corruption when changing network namespaces.
- Fix unregistation of the underying phydev.
2) Fix a potential warning when merging xfrm_plocy nodes.
From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
For high speed adapter like Mellanox CX-5 card, it can reach upto
100 Gbits per second bandwidth. Currently htb already supports 64bit rate
in tc utility. However police action rate and peakrate are still limited
to 32bit value (upto 32 Gbits per second). Add 2 new attributes
TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support
so that tc utility can use them for 64bit rate and peakrate value to
break the 32bit limit, and still keep the backward binary compatibility.
Tested-by: David Dai <zdai@linux.vnet.ibm.com>
Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offloaded OvS datapath rules are translated one to one to tc rules,
for example the following simplified OvS rule:
recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
Will be translated to the following tc rule:
$ tc filter add dev dev1 ingress \
prio 1 chain 0 proto ip \
flower tcp ct_state -trk \
action ct pipe \
action goto chain 2
Received packets will first travel though tc, and if they aren't stolen
by it, like in the above rule, they will continue to OvS datapath.
Since we already did some actions (action ct in this case) which might
modify the packets, and updated action stats, we would like to continue
the proccessing with the correct recirc_id in OvS (here recirc_id(2))
where we left off.
To support this, introduce a new skb extension for tc, which
will be used for translating tc chain to ovs recirc_id to
handle these miss cases. Last tc chain index will be set
by tc goto chain action and read by OvS datapath.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct mgmt_rp_get_connections {
...
struct mgmt_addr_info addr[0];
} __packed;
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes.
So, replace the following form:
sizeof(*rp) + (i * sizeof(struct mgmt_addr_info));
with:
struct_size(rp, addr, i)
Also, notice that, in this case, variable rp_len is not necessary,
hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Static variable header_ops, of type header_ops, is used only once, when
it is assigned to field header_ops of a variable having type net_device.
This corresponding field is declared as const in the definition of
net_device. Hence make header_ops constant as well to protect it from
unnecessary modification.
Issue found with Coccinelle.
Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Changes made to add support for fast advertising interval
as per core 4.1 specification, section 9.3.11.2.
A peripheral device entering any of the following GAP modes and
sending either non-connectable advertising events or scannable
undirected advertising events should use adv_fast_interval2
(100ms - 150ms) for adv_fast_period(30s).
- Non-Discoverable Mode
- Non-Connectable Mode
- Limited Discoverable Mode
- General Discoverable Mode
Signed-off-by: Spoorthi Ravishankar Koppad <spoorthix.k@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
When accessing the members of an XDP socket, the control mutex should
be held. This commit fixes that.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Fixes: a36b38aa2a ("xsk: add sock_diag interface for AF_XDP")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Prior the state variable was introduced by Ilya, the dev member was
used to determine whether the socket was bound or not. However, when
dev was read, proper SMP barriers and READ_ONCE were missing. In order
to address the missing barriers and READ_ONCE, we start using the
state variable as a point of synchronization. The state member
read/write is paired with proper SMP barriers, and from this follows
that the members described above does not need READ_ONCE if used in
conjunction with state check.
In all syscalls and the xsk_rcv path we check if state is
XSK_BOUND. If that is the case we do a SMP read barrier, and this
implies that the dev, umem and all rings are correctly setup. Note
that no READ_ONCE are needed for these variable if used when state is
XSK_BOUND (plus the read barrier).
To summarize: The members struct xdp_sock members dev, queue_id, umem,
fq, cq, tx, rx, and state were read lock-less, with incorrect barriers
and missing {READ, WRITE}_ONCE. Now, umem, fq, cq, tx, rx, and state
are read lock-less. When these members are updated, WRITE_ONCE is
used. When read, READ_ONCE are only used when read outside the control
mutex (e.g. mmap) or, not synchronized with the state member
(XSK_BOUND plus smp_rmb())
Note that dev and queue_id do not need a WRITE_ONCE or READ_ONCE, due
to the introduce state synchronization (XSK_BOUND plus smp_rmb()).
Introducing the state check also fixes a race, found by syzcaller, in
xsk_poll() where umem could be accessed when stale.
Suggested-by: Hillf Danton <hdanton@sina.com>
Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
Fixes: 77cd0d7b3f ("xsk: add support for need_wakeup flag in AF_XDP rings")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
The umem member of struct xdp_sock is read outside of the control
mutex, in the mmap implementation, and needs a WRITE_ONCE to avoid
potential store-tearing.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Fixes: 423f38329d ("xsk: add umem fill queue support and mmap")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Use WRITE_ONCE when doing the store of tx, rx, fq, and cq, to avoid
potential store-tearing. These members are read outside of the control
mutex in the mmap implementation.
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Fixes: 37b076933a ("xsk: add missing write- and data-dependency barrier")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Not all objects have an update operation. If the object type doesn't
implement an update operation and the user tries to update it will hit
EOPNOTSUPP.
Fixes: d62d0ba97b ("netfilter: nf_tables: Introduce stateful object update operation")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When creating a v4 route that uses a v6 nexthop from a nexthop group.
Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
attribute.
Broken behavior:
$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$
Fixed behavior:
$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$
v2, v3: Addresses code review comments from David Ahern
Fixes: dcb1ecb50e (“ipv4: Prepare for fib6_nh from a nexthop object”)
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use %*ph format to print small buffer as hex string.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEmvEkXzgOfc881GuFWsYho5HknSAFAl1vrJITHG1rbEBwZW5n
dXRyb25peC5kZQAKCRBaxiGjkeSdIC8BB/98XcWiaInD+SM6UjD2dVd1r0zhPKJS
WBK58G81+op3YP4DY8Iy+C24uZBlSlutVGoD/PIrZF39xXsnOtJuMVHC4LvtdADC
30uI/61JQNEjuX2AiTFudqDvYjZZKZ28HLqEnO2pWk3dMVL3+fkS3i7VQR7KJ/Gr
BYM6EzCdkbuWW/zsAVbKLJ8NswVmcdjP7eSK+exKppoWMtgCglZw1X6iP5YXDnbK
h3dGs687u8RfUra7j7vgnJzyQU4draMPsabaLDT5qw1PgYQ3k8MTVMBlULR0+HHO
qkBqumRwfOxay0z0XOgRuWrICKTH/b0SRLp3H53ZyfDo6+4TC9KGHRgX
=gwfZ
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.4-20190904' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2019-09-04 j1939
this is a pull request for net-next/master consisting of 21 patches.
the first 12 patches are by me and target the CAN core infrastructure.
They clean up the names of variables , structs and struct members,
convert can_rx_register() to use max() instead of open coding it and
remove unneeded code from the can_pernet_exit() callback.
The next three patches are also by me and they introduce and make use of
the CAN midlayer private structure. It is used to hold protocol specific
per device data structures.
The next patch is by Oleksij Rempel, switches the
&net->can.rcvlists_lock from a spin_lock() to a spin_lock_bh(), so that
it can be used from NAPI (soft IRQ) context.
The next 4 patches are by Kurt Van Dijck, he first updates his email
address via mailmap and then extends sockaddr_can to include j1939
members.
The final patch is the collective effort of many entities (The j1939
authors: Oliver Hartkopp, Bastian Stender, Elenita Hinds, kbuild test
robot, Kurt Van Dijck, Maxime Jayat, Robin van der Gracht, Oleksij
Rempel, Marc Kleine-Budde). It adds support of SAE J1939 protocol to the
CAN networking stack.
SAE J1939 is the vehicle bus recommended practice used for communication
and diagnostics among vehicle components. Originating in the car and
heavy-duty truck industry in the United States, it is now widely used in
other parts of the world.
P.S.: This pull request doesn't invalidate my last pull request:
"pull-request: can-next 2019-09-03".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A change to the core nla helpers was missed during the push of
the nexthop changes. rt6_fill_node_nexthop should be calling
nla_nest_start_noflag not nla_nest_start. Currently, iproute2
does not print multipath data because of parsing issues with
the attribute.
Fixes: f88d8ea67f ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_map and ULP only work together when ULP is loaded after the sock
map is loaded. In the sock_map case we added a check for this to fail
the load if ULP is already set. However, we missed the check on the
sock_hash side.
Add a ULP check to the sock_hash update path.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: syzbot+7a6ee4d0078eac6bf782@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The comment we have is just repeating what the code does.
Include the *reason* for the condition instead.
Cc: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If retransmit record hint fall into the cleanup window we will
free it by just walking the list. No need to duplicate the code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TLS code has a number of #ifdefs which make the code a little
harder to follow. Recent fixes removed the ifdef around the
TLS_HW define, so we can switch to the often used pattern
of defining tls_device functions as empty static inlines
in the header when CONFIG_TLS_DEVICE=n.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On setsockopt path we need to hold device_offload_lock from
the moment we check netdev is up until the context is fully
ready to be added to the tls_device_list.
No need to hold it around the get_netdev_for_sock().
Change the code and remove the confusing comment.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reusing parts of error path for normal exit will make
next commit harder to read, untangle the two.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we already have the pointer to the full original sk_proto
stored use that instead of storing all individual callback
pointers as well.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IN_MULTICAST's primary intent is as a uapi macro.
Elsewhere in the kernel we use ipv4_is_multicast consistently.
This patch unifies linux's multicast checks to use that function
rather than this macro.
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Variable port_rate is being initialized with a value that is never read
and is being re-assigned a little later on. The assignment is redundant
and hence can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth 2019-09-05
Here are a few more Bluetooth fixes for 5.3. I hope they can still make
it. There's one USB ID addition for btusb, two reverts due to discovered
regressions, and two other important fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit c49a8682fc.
There are devices which require low connection intervals for usable operation
including keyboards and mice. Forcing a static connection interval for
these types of devices has an impact in latency and causes a regression.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
There is a subtle change in behaviour introduced by:
commit c7a1ce397a
'ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create'
Before that patch /proc/net/ipv6_route includes:
00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo
Afterwards /proc/net/ipv6_route includes:
00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80240001 lo
ie. the above commit causes the ::1/128 local (automatic) route to be flagged with RTF_ADDRCONF (0x040000).
AFAICT, this is incorrect since these routes are *not* coming from RA's.
As such, this patch restores the old behaviour.
Fixes: c7a1ce397a ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
Cc: David Ahern <dsahern@gmail.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Transport should use its own pf_retrans to do the error_count
check, instead of asoc's. Otherwise, it's meaningless to make
pf_retrans per transport.
Fixes: 5aa93bcf66 ("sctp: Implement quick failover draft from tsvwg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a misplaced traceline in rxrpc_input_packet() which is looking at a
packet that just got released rather than the replacement packet.
Fix this by moving the traceline after the assignment that moves the new
packet pointer to the actual packet pointer.
Fixes: d0d5c0cd1e ("rxrpc: Use skb_unshare() rather than skb_cow_data()")
Reported-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SAE J1939 is the vehicle bus recommended practice used for communication
and diagnostics among vehicle components. Originating in the car and
heavy-duty truck industry in the United States, it is now widely used in
other parts of the world.
J1939, ISO 11783 and NMEA 2000 all share the same high level protocol.
SAE J1939 can be considered the replacement for the older SAE J1708 and
SAE J1587 specifications.
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Bastian Stender <bst@pengutronix.de>
Signed-off-by: Elenita Hinds <ecathinds@gmail.com>
Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
Signed-off-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Signed-off-by: Robin van der Gracht <robin@protonic.nl>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The size of this structure will be increased with J1939 support. To stay
binary compatible, the CAN_REQUIRED_SIZE macro is introduced for
existing CAN protocols.
Signed-off-by: Kurt Van Dijck <dev.kurt@vandijck-laurijssen.be>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The can_rx_unregister() can be called from NAPI (soft IRQ) context, at least
by j1939 stack. This leads to potential dead lock with &net->can.rcvlists_lock
called from can_rx_register:
===============================================================================
WARNING: inconsistent lock state
4.19.0-20181029-1-g3e67f95ba0d3 #3 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
testj1939/224 [HC0[0]:SC1[1]:HE1:SE0] takes:
1ad0fda3 (&(&net->can.rcvlists_lock)->rlock){+.?.}, at: can_rx_unregister+0x4c/0x1ac
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0xd0/0x1f4
_raw_spin_lock+0x30/0x40
can_rx_register+0x5c/0x14c
j1939_netdev_start+0xdc/0x1f8
j1939_sk_bind+0x18c/0x1c8
__sys_bind+0x70/0xb0
sys_bind+0x10/0x14
ret_fast_syscall+0x0/0x28
0xbedc9b64
irq event stamp: 2440
hardirqs last enabled at (2440): [<c01302c0>] __local_bh_enable_ip+0xac/0x184
hardirqs last disabled at (2439): [<c0130274>] __local_bh_enable_ip+0x60/0x184
softirqs last enabled at (2412): [<c08b0bf4>] release_sock+0x84/0xa4
softirqs last disabled at (2415): [<c013055c>] irq_exit+0x100/0x1b0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&net->can.rcvlists_lock)->rlock);
<Interrupt>
lock(&(&net->can.rcvlists_lock)->rlock);
*** DEADLOCK ***
2 locks held by testj1939/224:
#0: 168eb13b (rcu_read_lock){....}, at: netif_receive_skb_internal+0x3c/0x350
#1: 168eb13b (rcu_read_lock){....}, at: can_receive+0x88/0x1c0
===============================================================================
To avoid this situation, we should use spin_lock_bh() instead of spin_lock().
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Since using the "struct can_ml_priv" for the per device "struct
dev_rcv_lists" the call can_dev_rcv_lists_find() cannot fail anymore.
This patch simplifies af_can by removing the NULL pointer checks from
the dev_rcv_lists returned by can_dev_rcv_lists_find().
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch removes the old method of allocating the per device protocol
specific memory via a netdevice_notifier. This had the drawback, that
the allocation can fail, leading to a lot of null pointer checks in the
code. This also makes the live cycle management of this memory quite
complicated.
This patch switches from the allocating the struct can_dev_rcv_lists in
a NETDEV_REGISTER call to using the dev->ml_priv, which is allocated by
the driver since the previous patch.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch introduces the CAN midlayer private structure ("struct
can_ml_priv") which should be used to hold protocol specific per device
data structures. For now it's only member is "struct can_dev_rcv_lists".
The CAN midlayer private is allocated via alloc_netdev()'s private and
assigned to "struct net_device::ml_priv" during device creation. This is
done transparently for CAN drivers using alloc_candev(). The slcan, vcan
and vxcan drivers which are not using alloc_candev() have been adopted
manually. The memory layout of the netdev_priv allocated via
alloc_candev() will looke like this:
+-------------------------+
| driver's priv |
+-------------------------+
| struct can_ml_priv |
+-------------------------+
| array of struct sk_buff |
+-------------------------+
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The networking core takes care and unregisters every network device in
a namespace before calling the can_pernet_exit() hook. This patch
removes the unneeded cleanup.
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Suggested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch replaces an open coded max by the proper kernel define max().
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch gives the variables holding the CAN receiver and the receiver
list a better name by renaming them from "r to "rcv" and "rl" to
"recv_list".
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch add the commonly used prefix "can_" to the find_dev_rcv_lists()
function and moves the "find" to the end, as the function returns a struct
can_dev_rcv_list. This improves the overall readability of the code.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch add the commonly used prefix "can_" to the find_rcv_list()
function and add the "find" to the end, as the function returns a struct
rcv_list. This improves the overall readability of the code.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch gives the variables holding the CAN per device receive filter lists
a better name by renaming them from "d" to "dev_rcv_lists".
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch gives the variables holding the CAN receive filter lists a
better name by renaming them from "d" to "dev_rcv_lists".
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch improves the code reability by removing the redundant "can_"
prefix from the members of struct netns_can (as the struct netns_can itself
is the member "can" of the struct net.)
The conversion is done with:
sed -i \
-e "s/struct can_dev_rcv_lists \*can_rx_alldev_list;/struct can_dev_rcv_lists *rx_alldev_list;/" \
-e "s/spinlock_t can_rcvlists_lock;/spinlock_t rcvlists_lock;/" \
-e "s/struct timer_list can_stattimer;/struct timer_list stattimer; /" \
-e "s/can\.can_rx_alldev_list/can.rx_alldev_list/g" \
-e "s/can\.can_rcvlists_lock/can.rcvlists_lock/g" \
-e "s/can\.can_stattimer/can.stattimer/g" \
include/net/netns/can.h \
net/can/*.[ch]
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch rename the variables holding the CAN statistics (can_stats
and can_pstats) to pkg_stats and rcv_lists_stats which reflect better
their meaning.
The conversion is done with:
sed -i \
-e "s/can_stats\([^_]\)/pkg_stats\1/g" \
-e "s/can_pstats/rcv_lists_stats/g" \
net/can/proc.c
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch rename the variables holding the CAN statistics (can_stats
and can_pstats) to pkg_stats and rcv_lists_stats which reflect better
their meaning.
The conversion is done with:
sed -i \
-e "s/can_stats\([^_]\)/pkg_stats\1/g" \
-e "s/can_pstats/rcv_lists_stats/g" \
net/can/af_can.c
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch gives the members of the struct netns_can that are holding
the statistics a sensible name, by renaming struct netns_can::can_stats
into struct netns_can::pkg_stats and struct netns_can::can_pstats into
struct netns_can::rcv_lists_stats.
The conversion is done with:
sed -i \
-e "s:\(struct[^*]*\*\)can_stats;.*:\1pkg_stats;:" \
-e "s:\(struct[^*]*\*\)can_pstats;.*:\1rcv_lists_stats;:" \
-e "s/can\.can_stats/can.pkg_stats/g" \
-e "s/can\.can_pstats/can.rcv_lists_stats/g" \
net/can/*.[ch] \
include/net/netns/can.h
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch renames both "struct s_stats" and "struct s_pstats", to
"struct can_pkg_stats" and "struct can_rcv_lists_stats" to better
reflect their meaning and improve code readability.
The conversion is done with:
sed -i \
-e "s/struct s_stats/struct can_pkg_stats/g" \
-e "s/struct s_pstats/struct can_rcv_lists_stats/g" \
net/can/*.[ch] \
include/net/netns/can.h
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Set up the default timeout for this new entry otherwise the garbage
collector might quickly remove it right after the flowtable insertion.
Fixes: ac2a66665e ("netfilter: add generic flow table infrastructure")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If this flag is set, timeout and state are irrelevant to userspace.
Fixes: 90964016e5 ("netfilter: nf_conntrack: add IPS_OFFLOAD status bit")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
dealing with a IPv6 packet, it causes a kernel panic in
fib6_node_lookup_1(), crashing in bad_page_fault.
The panic is caused by trying to deference a very low address (0x38
in ppc64le), due to ipv6.fib6_main_tbl = NULL.
BUG: Kernel NULL pointer dereference at 0x00000038
The kernel panic was reproduced in a host that disabled IPv6 on boot and
have to process guest packets (coming from a bridge) using it's ip6tables.
Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
is not loaded.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the infrastructure needed for the stateful object update
support.
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The p9_tag_alloc() does not initialize the transport error t_err field.
The struct p9_req_t *req is allocated and stored in a struct p9_client
variable. The field t_err is never initialized before p9_conn_cancel()
checks its value.
KUMSAN(KernelUninitializedMemorySantizer, a new error detection tool)
reports this bug.
==================================================================
BUG: KUMSAN: use of uninitialized memory in p9_conn_cancel+0x2d9/0x3b0
Read of size 4 at addr ffff88805f9b600c by task kworker/1:2/1216
CPU: 1 PID: 1216 Comm: kworker/1:2 Not tainted 5.2.0-rc4+ #28
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: events p9_write_work
Call Trace:
dump_stack+0x75/0xae
__kumsan_report+0x17c/0x3e6
kumsan_report+0xe/0x20
p9_conn_cancel+0x2d9/0x3b0
p9_write_work+0x183/0x4a0
process_one_work+0x4d1/0x8c0
worker_thread+0x6e/0x780
kthread+0x1ca/0x1f0
ret_from_fork+0x35/0x40
Allocated by task 1979:
save_stack+0x19/0x80
__kumsan_kmalloc.constprop.3+0xbc/0x120
kmem_cache_alloc+0xa7/0x170
p9_client_prepare_req.part.9+0x3b/0x380
p9_client_rpc+0x15e/0x880
p9_client_create+0x3d0/0xac0
v9fs_session_init+0x192/0xc80
v9fs_mount+0x67/0x470
legacy_get_tree+0x70/0xd0
vfs_get_tree+0x4a/0x1c0
do_mount+0xba9/0xf90
ksys_mount+0xa8/0x120
__x64_sys_mount+0x62/0x70
do_syscall_64+0x6d/0x1e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 0:
(stack is not available)
The buggy address belongs to the object at ffff88805f9b6008
which belongs to the cache p9_req_t of size 144
The buggy address is located 4 bytes inside of
144-byte region [ffff88805f9b6008, ffff88805f9b6098)
The buggy address belongs to the page:
page:ffffea00017e6d80 refcount:1 mapcount:0 mapping:ffff888068b63740 index:0xffff88805f9b7d90 compound_mapcount: 0
flags: 0x100000000010200(slab|head)
raw: 0100000000010200 ffff888068b66450 ffff888068b66450 ffff888068b63740
raw: ffff88805f9b7d90 0000000000100001 00000001ffffffff 0000000000000000
page dumped because: kumsan: bad access detected
==================================================================
Link: http://lkml.kernel.org/r/20190613070854.10434-1-shuaibinglu@126.com
Signed-off-by: Lu Shuaibing <shuaibinglu@126.com>
[dominique.martinet@cea.fr: grouped the added init with the others]
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
The socket assignment is wrong, see skb_orphan():
When skb->destructor callback is not set, but skb->sk is set, this hits BUG().
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1651813
Fixes: 554ced0a6e ("netfilter: nf_tables: add support for native socket matching")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A kernel panic can happen if a host has disabled IPv6 on boot and have to
process guest packets (coming from a bridge) using it's ip6tables.
IPv6 packets need to be dropped if the IPv6 module is not loaded, and the
host ip6tables will be used.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When a function such as dsa_slave_create fails, currently the following
stack trace can be seen:
[ 2.038342] sja1105 spi0.1: Probed switch chip: SJA1105T
[ 2.054556] sja1105 spi0.1: Reset switch and programmed static config
[ 2.063837] sja1105 spi0.1: Enabled switch tagging
[ 2.068706] fsl-gianfar soc:ethernet@2d90000 eth2: error -19 setting up slave phy
[ 2.076371] ------------[ cut here ]------------
[ 2.080973] WARNING: CPU: 1 PID: 21 at net/core/devlink.c:6184 devlink_free+0x1b4/0x1c0
[ 2.088954] Modules linked in:
[ 2.092005] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 5.3.0-rc6-01360-g41b52e38d2b6-dirty #1746
[ 2.100912] Hardware name: Freescale LS1021A
[ 2.105162] Workqueue: events deferred_probe_work_func
[ 2.110287] [<c03133a4>] (unwind_backtrace) from [<c030d8cc>] (show_stack+0x10/0x14)
[ 2.117992] [<c030d8cc>] (show_stack) from [<c10b08d8>] (dump_stack+0xb4/0xc8)
[ 2.125180] [<c10b08d8>] (dump_stack) from [<c0349d04>] (__warn+0xe0/0xf8)
[ 2.132018] [<c0349d04>] (__warn) from [<c0349e34>] (warn_slowpath_null+0x40/0x48)
[ 2.139549] [<c0349e34>] (warn_slowpath_null) from [<c0f19d74>] (devlink_free+0x1b4/0x1c0)
[ 2.147772] [<c0f19d74>] (devlink_free) from [<c1064fc0>] (dsa_switch_teardown+0x60/0x6c)
[ 2.155907] [<c1064fc0>] (dsa_switch_teardown) from [<c1065950>] (dsa_register_switch+0x8e4/0xaa8)
[ 2.164821] [<c1065950>] (dsa_register_switch) from [<c0ba7fe4>] (sja1105_probe+0x21c/0x2ec)
[ 2.173216] [<c0ba7fe4>] (sja1105_probe) from [<c0b35948>] (spi_drv_probe+0x80/0xa4)
[ 2.180920] [<c0b35948>] (spi_drv_probe) from [<c0a4c1cc>] (really_probe+0x108/0x400)
[ 2.188711] [<c0a4c1cc>] (really_probe) from [<c0a4c694>] (driver_probe_device+0x78/0x1bc)
[ 2.196933] [<c0a4c694>] (driver_probe_device) from [<c0a4a3dc>] (bus_for_each_drv+0x58/0xb8)
[ 2.205414] [<c0a4a3dc>] (bus_for_each_drv) from [<c0a4c024>] (__device_attach+0xd0/0x168)
[ 2.213637] [<c0a4c024>] (__device_attach) from [<c0a4b1d0>] (bus_probe_device+0x84/0x8c)
[ 2.221772] [<c0a4b1d0>] (bus_probe_device) from [<c0a4b72c>] (deferred_probe_work_func+0x84/0xc4)
[ 2.230686] [<c0a4b72c>] (deferred_probe_work_func) from [<c03650a4>] (process_one_work+0x218/0x510)
[ 2.239772] [<c03650a4>] (process_one_work) from [<c03660d8>] (worker_thread+0x2a8/0x5c0)
[ 2.247908] [<c03660d8>] (worker_thread) from [<c036b348>] (kthread+0x148/0x150)
[ 2.255265] [<c036b348>] (kthread) from [<c03010e8>] (ret_from_fork+0x14/0x2c)
[ 2.262444] Exception stack(0xea965fb0 to 0xea965ff8)
[ 2.267466] 5fa0: 00000000 00000000 00000000 00000000
[ 2.275598] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 2.283729] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 2.290333] ---[ end trace ca5d506728a0581a ]---
devlink_free is complaining right here:
WARN_ON(!list_empty(&devlink->port_list));
This happens because devlink_port_unregister is no longer done right
away in dsa_port_setup when a DSA_PORT_TYPE_USER has failed.
Vivien said about this change that:
Also no need to call devlink_port_unregister from within dsa_port_setup
as this step is inconditionally handled by dsa_port_teardown on error.
which is not really true. The devlink_port_unregister function _is_
being called unconditionally from within dsa_port_setup, but not for
this port that just failed, just for the previous ones which were set
up.
ports_teardown:
for (i = 0; i < port; i++)
dsa_port_teardown(&ds->ports[i]);
Initially I was tempted to fix this by extending the "for" loop to also
cover the port that failed during setup. But this could have potentially
unforeseen consequences unrelated to devlink_port or even other types of
ports than user ports, which I can't really test for. For example, if
for some reason devlink_port_register itself would fail, then
unconditionally unregistering it in dsa_port_teardown would not be a
smart idea. The list might go on.
So just make dsa_port_setup undo the setup it had done upon failure, and
let the for loop undo the work of setting up the previous ports, which
are guaranteed to be brought up to a consistent state.
Fixes: 955222ca52 ("net: dsa: use a single switch statement for port setup")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Fix some length checks during OGM processing in batman-adv, from
Sven Eckelmann.
2) Fix regression that caused netfilter conntrack sysctls to not be
per-netns any more. From Florian Westphal.
3) Use after free in netpoll, from Feng Sun.
4) Guard destruction of pfifo_fast per-cpu qdisc stats with
qdisc_is_percpu_stats(), from Davide Caratti. Similar bug is fixed
in pfifo_fast_enqueue().
5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.
6) Handle neigh events on internal ports correctly in nfp, from John
Hurley.
7) Clear SKB timestamp in NF flow table code so that it does not
confuse fq scheduler. From Florian Westphal.
8) taprio destroy can crash if it is invoked in a failure path of
taprio_init(), because the list head isn't setup properly yet and
the list del is unconditional. Perform the list add earlier to
address this. From Vladimir Oltean.
9) Make sure to reapply vlan filters on device up, in aquantia driver.
From Dmitry Bogdanov.
10) sgiseeq driver releases DMA memory using free_page() instead of
dma_free_attrs(). From Christophe JAILLET.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
net: seeq: Fix the function used to release some memory in an error handling path
enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
net: bcmgenet: use ethtool_op_get_ts_info()
tc-testing: don't hardcode 'ip' in nsPlugin.py
net: dsa: microchip: add KSZ8563 compatibility string
dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
net: aquantia: fix out of memory condition on rx side
net: aquantia: linkstate irq should be oneshot
net: aquantia: reapply vlan filters on up
net: aquantia: fix limit of vlan filters
net: aquantia: fix removal of vlan 0
net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
taprio: Fix kernel panic in taprio_destroy
net: dsa: microchip: fill regmap_config name
rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
amd-xgbe: Fix error path in xgbe_mod_init()
netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder
mac80211: Correctly set noencrypt for PAE frames
...
Pointer iter is being initialized with a value that is never read and
is being re-assigned a little later on. The assignment is redundant
and hence can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds handlers for PLDM over NC-SI command response.
This enables NC-SI driver recognizes the packet type so the responses
don't get dropped as unknown packet type.
PLDM over NC-SI are not handled in kernel driver for now, but can be
passed back to user space via Netlink for further handling.
Signed-off-by: Ben Wei <benwei@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make core more readable with switch-case for various port flavours.
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink port index attribute is returned to users as u32 through
netlink response.
Change index data type from 'unsigned' to 'unsigned int' to avoid
below checkpatch.pl warning.
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
81: FILE: include/net/devlink.h:81:
+ unsigned index;
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an application configures kernel TLS on top of a TCP socket, it's
now possible for inet_diag_handler() to collect information regarding the
protocol version, the cipher type and TX / RX configuration, in case
INET_DIAG_INFO is requested.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
currently, only getsockopt(TCP_ULP) can be invoked to know if a ULP is on
top of a TCP socket. Extend idiag_get_aux() and idiag_get_aux_size(),
introduced by commit b37e88407c ("inet_diag: allow protocols to provide
additional data"), to report the ULP name and other information that can
be made available by the ULP through optional functions.
Users having CAP_NET_ADMIN privileges will then be able to retrieve this
information through inet_diag_handler, if they specify INET_DIAG_INFO in
the request.
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to make sure context does not get freed while diag
code is interrogating it. Free struct tls_context with
kfree_rcu().
We add the __rcu annotation directly in icsk, and cast it
away in the datapath accessor. Presumably all ULPs will
do a similar thing.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The discussion to be made is absolutely the same as in the case of
previous patch ("taprio: Set default link speed to 10 Mbps in
taprio_set_picos_per_byte"). Nothing is lost when setting a default.
Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Fixes: e0a7683d30 ("net/sched: cbs: fix port_rate miscalculation")
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The taprio budget needs to be adapted at runtime according to interface
link speed. But that handling is problematic.
For one thing, installing a qdisc on an interface that doesn't have
carrier is not illegal. But taprio prints the following stack trace:
[ 31.851373] ------------[ cut here ]------------
[ 31.856024] WARNING: CPU: 1 PID: 207 at net/sched/sch_taprio.c:481 taprio_dequeue+0x1a8/0x2d4
[ 31.864566] taprio: dequeue() called with unknown picos per byte.
[ 31.864570] Modules linked in:
[ 31.873701] CPU: 1 PID: 207 Comm: tc Not tainted 5.3.0-rc5-01199-g8838fe023cd6 #1689
[ 31.881398] Hardware name: Freescale LS1021A
[ 31.885661] [<c03133a4>] (unwind_backtrace) from [<c030d8cc>] (show_stack+0x10/0x14)
[ 31.893368] [<c030d8cc>] (show_stack) from [<c10ac958>] (dump_stack+0xb4/0xc8)
[ 31.900555] [<c10ac958>] (dump_stack) from [<c0349d04>] (__warn+0xe0/0xf8)
[ 31.907395] [<c0349d04>] (__warn) from [<c0349d64>] (warn_slowpath_fmt+0x48/0x6c)
[ 31.914841] [<c0349d64>] (warn_slowpath_fmt) from [<c0f38db4>] (taprio_dequeue+0x1a8/0x2d4)
[ 31.923150] [<c0f38db4>] (taprio_dequeue) from [<c0f227b0>] (__qdisc_run+0x90/0x61c)
[ 31.930856] [<c0f227b0>] (__qdisc_run) from [<c0ec82ac>] (net_tx_action+0x12c/0x2bc)
[ 31.938560] [<c0ec82ac>] (net_tx_action) from [<c0302298>] (__do_softirq+0x130/0x3c8)
[ 31.946350] [<c0302298>] (__do_softirq) from [<c03502a0>] (irq_exit+0xbc/0xd8)
[ 31.953536] [<c03502a0>] (irq_exit) from [<c03a4808>] (__handle_domain_irq+0x60/0xb4)
[ 31.961328] [<c03a4808>] (__handle_domain_irq) from [<c0754478>] (gic_handle_irq+0x58/0x9c)
[ 31.969638] [<c0754478>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
[ 31.977076] Exception stack(0xe8167b20 to 0xe8167b68)
[ 31.982100] 7b20: e9d4bd80 00000cc0 000000cf 00000000 e9d4bd80 c1f38958 00000cc0 c1f38960
[ 31.990234] 7b40: 00000001 000000cf 00000004 e9dc0800 00000000 e8167b70 c0f478ec c0f46d94
[ 31.998363] 7b60: 60070013 ffffffff
[ 32.001833] [<c0301a8c>] (__irq_svc) from [<c0f46d94>] (netlink_trim+0x18/0xd8)
[ 32.009104] [<c0f46d94>] (netlink_trim) from [<c0f478ec>] (netlink_broadcast_filtered+0x34/0x414)
[ 32.017930] [<c0f478ec>] (netlink_broadcast_filtered) from [<c0f47cec>] (netlink_broadcast+0x20/0x28)
[ 32.027102] [<c0f47cec>] (netlink_broadcast) from [<c0eea378>] (rtnetlink_send+0x34/0x88)
[ 32.035238] [<c0eea378>] (rtnetlink_send) from [<c0f25890>] (notify_and_destroy+0x2c/0x44)
[ 32.043461] [<c0f25890>] (notify_and_destroy) from [<c0f25e08>] (qdisc_graft+0x398/0x470)
[ 32.051595] [<c0f25e08>] (qdisc_graft) from [<c0f27a00>] (tc_modify_qdisc+0x3a4/0x724)
[ 32.059470] [<c0f27a00>] (tc_modify_qdisc) from [<c0ee4c84>] (rtnetlink_rcv_msg+0x260/0x2ec)
[ 32.067864] [<c0ee4c84>] (rtnetlink_rcv_msg) from [<c0f4a988>] (netlink_rcv_skb+0xb8/0x110)
[ 32.076172] [<c0f4a988>] (netlink_rcv_skb) from [<c0f4a170>] (netlink_unicast+0x1b4/0x22c)
[ 32.084392] [<c0f4a170>] (netlink_unicast) from [<c0f4a5e4>] (netlink_sendmsg+0x33c/0x380)
[ 32.092614] [<c0f4a5e4>] (netlink_sendmsg) from [<c0ea9f40>] (sock_sendmsg+0x14/0x24)
[ 32.100403] [<c0ea9f40>] (sock_sendmsg) from [<c0eaa780>] (___sys_sendmsg+0x214/0x228)
[ 32.108279] [<c0eaa780>] (___sys_sendmsg) from [<c0eabad0>] (__sys_sendmsg+0x50/0x8c)
[ 32.116068] [<c0eabad0>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54)
[ 32.123938] Exception stack(0xe8167fa8 to 0xe8167ff0)
[ 32.128960] 7fa0: b6fa68c8 000000f8 00000003 bea142d0 00000000 00000000
[ 32.137093] 7fc0: b6fa68c8 000000f8 0052154c 00000128 5d6468a2 00000000 00000028 00558c9c
[ 32.145224] 7fe0: 00000070 bea14278 00530d64 b6e17e64
[ 32.150659] ---[ end trace 2139c9827c3e5177 ]---
This happens because the qdisc ->dequeue callback gets called. Which
again is not illegal, the qdisc will dequeue even when the interface is
up but doesn't have carrier (and hence SPEED_UNKNOWN), and the frames
will be dropped further down the stack in dev_direct_xmit().
And, at the end of the day, for what? For calculating the initial budget
of an interface which is non-operational at the moment and where frames
will get dropped anyway.
So if we can't figure out the link speed, default to SPEED_10 and move
along. We can also remove the runtime check now.
Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Fixes: 7b9eba7ba0 ("net/sched: taprio: fix picos_per_byte miscalculation")
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
taprio_init may fail earlier than this line:
list_add(&q->taprio_list, &taprio_list);
i.e. due to the net device not being multi queue.
Attempting to remove q from the global taprio_list when it is not part
of it will result in a kernel panic.
Fix it by matching list_add and list_del better to one another in the
order of operations. This way we can keep the deletion unconditional
and with lower complexity - O(1).
Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Fixes: 7b9eba7ba0 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge core assumes that enabling/disabling vlan_filtering will
translate into the simple toggling of a flag for switchdev drivers.
That is clearly not the case for sja1105, which alters the VLAN table
and the pvids in order to obtain port separation in standalone mode.
There are 2 parts to the issue.
First, tag_8021q changes the pvid to a unique per-port rx_vid for frame
identification. But we need to disable tag_8021q when vlan_filtering
kicks in, and at that point, the VLAN configured as pvid will have to be
removed from the filtering table of the ports. With an invalid pvid, the
ports will drop all traffic. Since the bridge will not call any vlan
operation through switchdev after enabling vlan_filtering, we need to
ensure we're in a functional state ourselves. Hence read the pvid that
the bridge is aware of, and program that into our ports.
Secondly, tag_8021q uses the 1024-3071 range privately in
vlan_filtering=0 mode. Had the user installed one of these VLANs during
a previous vlan_filtering=1 session, then upon the next tag_8021q
cleanup for vlan_filtering to kick in again, VLANs in that range will
get deleted unconditionally, hence breaking user expectation. So when
deleting the VLANs, check if the bridge had knowledge about them, and if
it did, re-apply the settings. Wrap this logic inside a
dsa_8021q_vid_apply helper function to reduce code duplication.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently this simplified code snippet fails:
br_vlan_get_pvid(netdev, &pvid);
br_vlan_get_info(netdev, pvid, &vinfo);
ASSERT(!(vinfo.flags & BRIDGE_VLAN_INFO_PVID));
It is intuitive that the pvid of a netdevice should have the
BRIDGE_VLAN_INFO_PVID flag set.
However I can't seem to pinpoint a commit where this behavior was
introduced. It seems like it's been like that since forever.
At a first glance it would make more sense to just handle the
BRIDGE_VLAN_INFO_PVID flag in __vlan_add_flags. However, as Nikolay
explains:
There are a few reasons why we don't do it, most importantly because
we need to have only one visible pvid at any single time, even if it's
stale - it must be just one. Right now that rule will not be violated
by this change, but people will try using this flag and could see two
pvids simultaneously. You can see that the pvid code is even using
memory barriers to propagate the new value faster and everywhere the
pvid is read only once. That is the reason the flag is set
dynamically when dumping entries, too. A second (weaker) argument
against would be given the above we don't want another way to do the
same thing, specifically if it can provide us with two pvids (e.g. if
walking the vlan list) or if it can provide us with a pvid different
from the one set in the vg. [Obviously, I'm talking about RCU
pvid/vlan use cases similar to the dumps. The locked cases are fine.
I would like to avoid explaining why this shouldn't be relied upon
without locking]
So instead of introducing the above change and making sure of the pvid
uniqueness under RCU, simply dynamically populate the pvid flag in
br_vlan_get_info().
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Fix OGM and OGMv2 header read boundary check,
by Sven Eckelmann (2 patches)
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl1ozuUWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQt8D/0U5gjwcMbqt5W9UekW6Ci+Up6I
jJ3/hoZUD8lLWeKkYkSHdTUElpy0bLdLVGjUBJIZxM+UrCuKSTt0u04PAWN80JhZ
JdRblj0qicdwKll0Oyw08Ind5FKLLgDjN30z/9mDDRguMxJavowdBtmb5y/ybbiU
o4M/fnSkhUwiRwWK3cUKq1SVUrjAOg/C3fE7zVrn8XzRxH4TGvReCZTLKZZa9cGJ
m6Le68zT/JOrGe3O0uwQXbHFl+eqKYqNfrV4GBhL6saqLTrr3naiXjIP5YfyNONm
U0GRjmXWFQwfwwNxYeLqspwrg8VVuhy84H2FnrieY5kyJVQuX8637XZn9HDDxtkl
42TQ/jtogZ6GVamhN4c7HpvITeoPaVx8HLUJ3TPU89c7Va19D/XFBSKBL7nGADaM
FK0s8KZHXVkY18Lh4ak6dttjyZAnv7aNlpW7h0GyXJ0vTO6bU+QAQNOGZsO1romH
BHlX24+Y9G1PxtDHXE+fvEH4uolXuyOG6bgGVAEfWGLQdZoX5N9VjcRvP+BE7qUF
O4+I/sTVLXI65FYy/cRE+XwckXumZiq/PHqKHY7KB51Z0ZiMsi4yy0joVm64t/22
lm6922MP+r3vKudvMev08SKqFEcPze0JkwYE8XuirgnnjRsE+e1VJN2Z1j7yQKcI
K9gpMw7UREKglnwrhQ==
=hD2g
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20190830' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
- Fix OGM and OGMv2 header read boundary check,
by Sven Eckelmann (2 patches)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Spurious warning when loading rules using the physdev match,
from Todd Seidelmann.
2) Fix FTP conntrack helper debugging output, from Thomas Jarosch.
3) Restore per-netns nf_conntrack_{acct,helper,timeout} sysctl knobs,
from Florian Westphal.
4) Clear skbuff timestamp from the flowtable datapath, also from Florian.
5) Fix incorrect byteorder of NFT_META_BRI_IIFVPROTO, from wenxu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be placed
at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing with
aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
If a SYN cookie is not issued by tcp_v#_gen_syncookie, then the return
value will be exactly 0, rather than <= 0. Let's change the check to
reflect that, especially since mss is an unsigned value and cannot be
negative.
Fixes: 70d6624431 ("bpf: add bpf_tcp_gen_syncookie helper")
Reported-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Petar Penkov <ppenkov@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Recent rtnl lock removal patch changed flow_action infra to require proper
cleanup besides simple memory deallocation. However, matchall classifier
was not updated to call tc_cleanup_flow_action(). Add proper cleanup to
mall_replace_hw_filter() and mall_reoffload().
Fixes: 5a6ff4b13d ("net: sched: take reference to action dev before calling offloads")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This explicitly clarifies that bbr_bdp() returns the rounded-up value of
the bandwidth-delay product and why in the comments.
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement this callback in order to get the offloaded stats added to the
kernel stats.
Reported-by: Pengfei Liu <pengfeil@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a local endpoint is ceases to be in use, such as when the kafs module
is unloaded, the kernel will emit an assertion failure if there are any
outstanding client connections:
rxrpc: Assertion failed
------------[ cut here ]------------
kernel BUG at net/rxrpc/local_object.c:433!
and even beyond that, will evince other oopses if there are service
connections still present.
Fix this by:
(1) Removing the triggering of connection reaping when an rxrpc socket is
released. These don't actually clean up the connections anyway - and
further, the local endpoint may still be in use through another
socket.
(2) Mark the local endpoint as dead when we start the process of tearing
it down.
(3) When destroying a local endpoint, strip all of its client connections
from the idle list and discard the ref on each that the list was
holding.
(4) When destroying a local endpoint, call the service connection reaper
directly (rather than through a workqueue) to immediately kill off all
outstanding service connections.
(5) Make the service connection reaper reap connections for which the
local endpoint is marked dead.
Only after destroying the connections can we close the socket lest we get
an oops in a workqueue that's looking at a connection or a peer.
Fixes: 3d18cbb7fd ("rxrpc: Fix conn expiry timers")
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl1nyGUACgkQ+7dXa6fL
C2v8zhAAlJvZ3DQJLnGiXFdBzGLEWP/TtVBHOjXjAVPB/nGUy9VZ8eCx6jgIDOUj
79jFqrO0zoNkdVVmhml8bTt4hl4MlaZbKM5/yz5wg7I3RfAss5cOJyNx4eULYhQQ
V+QPn4uUj7jR/2KBYf+AasFUx4NZVQIeyo3H5mOmi3gatDIR3sqskx48mdSJSR2f
nuila/WE+g/iEygw9TwaqdrfR+4E94Sw4FoHIVy2rIlLWeuOfVInFAn7Tw9CsnZN
nTy+KBiYgJsO5f5bqaoKC7Ku4cmHD+Gy+AciETlvjk5Gjent5V7dHvnSL14pC7jD
WoOXMq+V93uzCHRz2iSHrj0FZJH5k7Q8OlioNr7u4FHdOBqZc1eJvGR6KHcGTOcU
RZGlSwal1+FQ66LY1OIf0EjBYcYOkSB3hZJhTwwMOm1ZWiPdTq/J3FIN+f6POWLL
djd4NVhlYTz8zaDoMIA+iWlGrR3IMy3+uH91CNMJnTRIb4l0wg2As61ffEixD69L
wW7C3VD1ZbFrlEv/33/a9dn0HzEfBrbMCkuKz/IqPj4W4yZNczb+6WOTkIN/nGFE
9u1Pok2W32QeOBxDysQqJa/zT/5suSbhleMRAiGVj78yhKaK340IEmd13a0ihw7m
blw09an6VG9DGFjvy5fFmpmELFw/zXwPhUSZLTWeZy7HQBIdEyU=
=1xih
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20190827' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix use of skb_cow_data()
Here's a series of patches that replaces the use of skb_cow_data() in rxrpc
with skb_unshare() early on in the input process. The problem that is
being seen is that skb_cow_data() indirectly requires that the maximum
usage count on an sk_buff be 1, and it may generate an assertion failure in
pskb_expand_head() if not.
This can occur because rxrpc_input_data() may be still holding a ref when
it has just attached the sk_buff to the rx ring and given that attachment
its own ref. If recvmsg happens fast enough, skb_cow_data() can see the
ref still held by the softirq handler.
Further, a packet may contain multiple subpackets, each of which gets its
own attachment to the ring and its own ref - also making skb_cow_data() go
bang.
Fix this by:
(1) The DATA packet is currently parsed for subpackets twice by the input
routines. Parse it just once instead and make notes in the sk_buff
private data.
(2) Use the notes from (1) when attaching the packet to the ring multiple
times. Once the packet is attached to the ring, recvmsg can see it
and start modifying it, so the softirq handler is not permitted to
look inside it from that point.
(3) Pass the ref from the input code to the ring rather than getting an
extra ref. rxrpc_input_data() uses a ref on the second refcount to
prevent the packet from evaporating under it.
(4) Call skb_unshare() on secured DATA packets in rxrpc_input_packet()
before we take call->input_lock. Other sorts of packets don't get
modified and so can be left.
A trace is emitted if skb_unshare() eats the skb. Note that
skb_share() for our accounting in this regard as we can't see the
parameters in the packet to log in a trace line if it releases it.
(5) Remove the calls to skb_cow_data(). These are then no longer
necessary.
There are also patches to improve the rxrpc_skb tracepoint to make sure
that Tx-derived buffers are identified separately from Rx-derived buffers
in the trace.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl1pLO8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi8ceB/4qXnsvkIfB7XMxg8KUXOtJiJPJX+j2
H8yFESAL7Qof3mJVmsnVREW0LBECxJbipSkFmsJtFgLMlGLAkGMvJQjrmp4EWlc7
OIMWirTkZHLUgWV485EXDerlL9XSyiTxQ3ccKLwQSSxeN+EBU8CnzYzw5rUy2pJl
n1RheF+nRn4sLeOggnXPbEEYHqyDcgOzVVBbP7dq0om8H8KV/1jz4w12Ybpm4BSb
s/3Sp9kSeT0VySlmoeSYly5/rNnYHgHzV5/qjEbzugVnqPrLTC+j9t3vJTpMoH7T
zum6Y41mxWLdXiiYAdccOmGEKXDXk2yxFEIVKwYBnsvLS1A11cXpuPdL
=3uNw
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.3-rc7' of git://github.com/ceph/ceph-client
Pull two ceph fixes from Ilya Dryomov:
"A fix for a -rc1 regression in rbd and a trivial static checker fix"
* tag 'ceph-for-5.3-rc7' of git://github.com/ceph/ceph-client:
rbd: restore zeroing past the overlap when reading from parent
libceph: don't call crypto_free_sync_skcipher() on a NULL tfm
This is useful for checking how much airtime is being used up by other
transmissions on the channel, e.g. by calculating (time_rx - time_bss_rx)
or (time_busy - time_bss_rx - time_tx)
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20190828102042.58016-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Get the vlan_proto of ingress bridge in network byteorder as userspace
expects. Otherwise this is inconsistent with NFT_META_PROTOCOL.
Fixes: 2a3a93ef0b ("netfilter: nft_meta_bridge: Add NFT_META_BRI_IIFVPROTO support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The noencrypt flag was intended to be set if the "frame was received
unencrypted" according to include/uapi/linux/nl80211.h. However, the
current behavior is opposite of this.
Cc: stable@vger.kernel.org
Fixes: 018f6fbf54 ("mac80211: Send control port frames over nl80211")
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Link: https://lore.kernel.org/r/20190827224120.14545-3-denkenz@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In ieee80211_deliver_skb_to_local_stack intercepts EAPoL frames if
mac80211 is configured to do so and forwards the contents over nl80211.
During this process some additional data is also forwarded, including
whether the frame was received encrypted or not. Unfortunately just
prior to the call to ieee80211_deliver_skb_to_local_stack, skb->cb is
cleared, resulting in incorrect data being exposed over nl80211.
Fixes: 018f6fbf54 ("mac80211: Send control port frames over nl80211")
Cc: stable@vger.kernel.org
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Link: https://lore.kernel.org/r/20190827224120.14545-2-denkenz@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If 'fq' qdisc is used and a program has requested timestamps,
skb->tstamp needs to be cleared, else fq will treat these as
'transmit time'.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP associates tx timestamp requests with a byte in the bytestream.
If merging skbs in tcp_mtu_probe, migrate the tstamp request.
Similar to MSG_EOR, do not allow moving a timestamp from any segment
in the probe but the last. This to avoid merging multiple timestamps.
Tested with the packetdrill script at
https://github.com/wdebruij/packetdrill/commits/mtu_probe-1
Link: http://patchwork.ozlabs.org/patch/1143278/#2232897
Fixes: 4ed2d765df ("net-timestamp: TCP timestamping")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Action sample doesn't properly handle psample_group pointer in overwrite
case. Following issues need to be fixed:
- In tcf_sample_init() function RCU_INIT_POINTER() is used to set
s->psample_group, even though we neither setting the pointer to NULL, nor
preventing concurrent readers from accessing the pointer in some way.
Use rcu_swap_protected() instead to safely reset the pointer.
- Old value of s->psample_group is not released or deallocated in any way,
which results resource leak. Use psample_group_put() on non-NULL value
obtained with rcu_swap_protected().
- The function psample_group_put() that released reference to struct
psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
grace period when deallocating it. Extend struct psample_group with rcu
head and use kfree_rcu when freeing it.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only the first fragment in a datagram contains the L4 headers. When the
Open vSwitch module parses a packet, it always sets the IP protocol
field in the key, but can only set the L4 fields on the first fragment.
The original behavior would not clear the L4 portion of the key, so
garbage values would be sent in the key for "later" fragments. This
patch clears the L4 fields in that circumstance to prevent sending those
garbage values as part of the upcall.
Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When IP fragments are reassembled before being sent to conntrack, the
key from the last fragment is used. Unless there are reordering
issues, the last fragment received will not contain the L4 ports, so the
key for the reassembled datagram won't contain them. This patch updates
the key once we have a reassembled datagram.
The handle_fragments() function works on L3 headers so we pull the L3/L4
flow key update code from key_extract into a new function
'key_extract_l3l4'. Then we add a another new function
ovs_flow_key_update_l3l4() and export it so that it is accessible by
handle_fragments() for conntrack packet reassembly.
Co-authored-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
'TCQ_F_NOLOCK' bit in the parent qdisc, we need to be sure that per-cpu
counters are present when 'reset()' is called for pfifo_fast qdiscs.
Otherwise, the following script:
# tc q a dev lo handle 1: root htb default 100
# tc c a dev lo parent 1: classid 1:100 htb \
> rate 95Mbit ceil 100Mbit burst 64k
[...]
# tc f a dev lo parent 1: protocol arp basic classid 1:100
[...]
# tc q a dev lo parent 1:100 handle 100: pfifo_fast
[...]
# tc q d dev lo root
can generate the following splat:
Unable to handle kernel paging request at virtual address dfff2c01bd148000
Mem abort info:
ESR = 0x96000004
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
[dfff2c01bd148000] address between user and kernel address ranges
Internal error: Oops: 96000004 [#1] SMP
[...]
pstate: 80000005 (Nzcv daif -PAN -UAO)
pc : pfifo_fast_reset+0x280/0x4d8
lr : pfifo_fast_reset+0x21c/0x4d8
sp : ffff800d09676fa0
x29: ffff800d09676fa0 x28: ffff200012ee22e4
x27: dfff200000000000 x26: 0000000000000000
x25: ffff800ca0799958 x24: ffff1001940f332b
x23: 0000000000000007 x22: ffff200012ee1ab8
x21: 0000600de8a40000 x20: 0000000000000000
x19: ffff800ca0799900 x18: 0000000000000000
x17: 0000000000000002 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: ffff1001b922e6e2
x11: 1ffff001b922e6e1 x10: 0000000000000000
x9 : 1ffff001b922e6e1 x8 : dfff200000000000
x7 : 0000000000000000 x6 : 0000000000000000
x5 : 1fffe400025dc45c x4 : 1fffe400025dc357
x3 : 00000c01bd148000 x2 : 0000600de8a40000
x1 : 0000000000000007 x0 : 0000600de8a40004
Call trace:
pfifo_fast_reset+0x280/0x4d8
qdisc_reset+0x6c/0x370
htb_reset+0x150/0x3b8 [sch_htb]
qdisc_reset+0x6c/0x370
dev_deactivate_queue.constprop.5+0xe0/0x1a8
dev_deactivate_many+0xd8/0x908
dev_deactivate+0xe4/0x190
qdisc_graft+0x88c/0xbd0
tc_get_qdisc+0x418/0x8a8
rtnetlink_rcv_msg+0x3a8/0xa78
netlink_rcv_skb+0x18c/0x328
rtnetlink_rcv+0x28/0x38
netlink_unicast+0x3c4/0x538
netlink_sendmsg+0x538/0x9a0
sock_sendmsg+0xac/0xf8
___sys_sendmsg+0x53c/0x658
__sys_sendmsg+0xc8/0x140
__arm64_sys_sendmsg+0x74/0xa8
el0_svc_handler+0x164/0x468
el0_svc+0x10/0x14
Code: 910012a0 92400801 d343fc03 11000c21 (38fb6863)
Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
before dereferencing 'qdisc->cpu_qstats'.
Changes since v1:
- coding style improvements, thanks to Stefano Brivio
Fixes: 8a53e616de ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
CC: Paolo Abeni <pabeni@redhat.com>
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In set_secret(), key->tfm is assigned to NULL on line 55, and then
ceph_crypto_key_destroy(key) is executed.
ceph_crypto_key_destroy(key)
crypto_free_sync_skcipher(key->tfm)
crypto_free_skcipher(&tfm->base);
This happens to work because crypto_sync_skcipher is a trivial wrapper
around crypto_skcipher: &tfm->base is still 0 and crypto_free_skcipher()
handles that. Let's not rely on the layout of crypto_sync_skcipher.
This bug is found by a static analysis tool STCheck written by us.
Fixes: 69d6302b65 ("libceph: Remove VLA usage of skcipher").
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Vladimir Rutsky reported stuck TCP sessions after memory pressure
events. Edge Trigger epoll() user would never receive an EPOLLOUT
notification allowing them to retry a sendmsg().
Jason tested the case of sk_stream_alloc_skb() returning NULL,
but there are other paths that could lead both sendmsg() and sendpage()
to return -1 (EAGAIN), with an empty skb queued on the write queue.
This patch makes sure we remove this empty skb so that
Jason code can detect that the queue is empty, and
call sk->sk_write_space(sk) accordingly.
Fixes: ce5ec44099 ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The rds6_inc_info_copy() function has a couple struct members which
are leaking stack information. The ->tos field should hold actual
information and the ->flags field needs to be zeroed out.
Fixes: 3eb450367d ("rds: add type of service(tos) infrastructure")
Fixes: b7ff8b1036 ("rds: Extend RDS API for IPv6 support")
Reported-by: 黄ID蝴蝶 <butterflyhuangxx@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP_ECN_SUPPORTED sockopt will be added to allow users to change
ep ecn flag, and it's similar with other feature flags.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl net.sctp.ecn_enable is added in this patch. It will allow
users to change the default sctp ecn flag, net.sctp.ecn_enable.
This feature was also required on this thread:
http://lkml.iu.edu/hypermail/linux/kernel/0812.1/01858.html
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to add ecn flag for both netns_sctp and sctp_endpoint,
net->sctp.ecn_enable is set 1 by default, and ep->ecn_enable will
be initialized with net->sctp.ecn_enable.
asoc->peer.ecn_capable will be set during negotiation only when
ep->ecn_enable is set on both sides.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding a VLAN sub-interface on a DSA slave port, the 8021q core
checks NETIF_F_HW_VLAN_CTAG_FILTER and, if the netdev is capable of
filtering, calls .ndo_vlan_rx_add_vid or .ndo_vlan_rx_kill_vid to
configure the VLAN offloading.
DSA sets this up counter-intuitively: it always advertises this netdev
feature, but the underlying driver may not actually support VLAN table
manipulation. In that case, the DSA core is forced to ignore the error,
because not being able to offload the VLAN is still fine - and should
result in the creation of a non-accelerated VLAN sub-interface.
Change this so that the netdev feature is only advertised for switch
drivers that support VLAN manipulation, instead of checking for
-EOPNOTSUPP at runtime.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After witnessing the discussion in https://lkml.org/lkml/2019/8/14/151
w.r.t. ioctl extensibility, it became clear that such an issue might
prevent that the 3 RSV bits inside the DSA 802.1Q tag might also suffer
the same fate and be useless for further extension.
So clearly specify that the reserved bits should currently be
transmitted as zero and ignored on receive. The DSA tagger already does
this (and has always did), and is the only known user so far (no
Wireshark dissection plugin, etc). So there should be no incompatibility
to speak of.
Fixes: 0471dd429c ("net: dsa: tag_8021q: Create a stable binary format")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the bridge offloads a VLAN on a slave port, we also need to
program its dedicated CPU port as a member of the VLAN.
Drivers may handle the CPU port's membership as they want. For example,
Marvell as a special "Unmodified" mode to pass frames as is through
such ports.
Even though DSA expects the drivers to handle the CPU port membership,
it does not make sense to program user VLANs as PVID on the CPU port.
This patch clears this flag before programming the CPU port.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA currently programs a VLAN on the CPU port implicitly after the
related notifier is received by a switch.
While we still need to do this transparent programmation of the DSA
links in the fabric, programming the CPU port this way may cause
problems in some corners such as the tag_8021q driver.
Because the dedicated CPU port is specific to a slave, make their
programmation explicit a few layers up, in the slave code.
Note that technically, DSA links have a dedicated CPU port as well,
but since they are only used as conduit between interconnected switches
of a fabric, programming them transparently this way is what we want.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge VLANs are not offloaded by dsa_port_vlan_* if the port is
not bridged or if its bridge is not VLAN aware.
This is a good thing but other corners of DSA, such as the tag_8021q
driver, may need to program VLANs regardless the bridge state.
And also because bridge_dev is specific to user ports anyway, move
these checks were it belongs, one layer up in the slave code.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add dsa_slave_vlan_add and dsa_slave_vlan_del helpers to handle
SWITCHDEV_OBJ_ID_PORT_VLAN switchdev objects. Also copy the
switchdev_obj_port_vlan structure on add since we will modify it in
future patches.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently dsa_port_vid_add returns 0 if the switch returns -EOPNOTSUPP.
This function is used in the tag_8021q.c code to offload the PVID of
ports, which would simply not work if .port_vlan_add is not supported
by the underlying switch.
Do not skip -EOPNOTSUPP in dsa_port_vid_add but only when necessary,
that is to say in dsa_slave_vlan_rx_add_vid.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The bitmap operations were introduced to simplify the switch drivers
in the future, since most of them could implement the common VLAN and
MDB operations (add, del, dump) with simple functions taking all target
ports at once, and thus limiting the number of hardware accesses.
Programming an MDB or VLAN this way in a single operation would clearly
simplify the drivers a lot but would require a new get-set interface
in DSA. The usage of such bitmap from the stack also raised concerned
in the past, leading to the dynamic allocation of a new ds->_bitmap
member in the dsa_switch structure. So let's get rid of them for now.
This commit nicely wraps the ds->ops->port_{mdb,vlan}_{prepare,add}
switch operations into new dsa_switch_{mdb,vlan}_{prepare,add}
variants not using any bitmap argument anymore.
New dsa_switch_{mdb,vlan}_match helpers have been introduced to make
clear which local port of a switch must be programmed with the target
object. While the targeted user port is an obvious candidate, the
DSA links must also be programmed, as well as the CPU port for VLANs.
While at it, also remove local variables that are only used once.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The net pointer in struct xt_tgdtor_param is not explicitly
initialized therefore is still NULL when dereferencing it.
So we have to find a way to pass the correct net pointer to
ipt_destroy_target().
The best way I find is just saving the net pointer inside the per
netns struct tcf_idrinfo, which could make this patch smaller.
Fixes: 0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Reported-and-tested-by: itugrok@yahoo.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidat
ed"
Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort for
soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl1lgz4ACgkQZwvnipYK
APIsHhAApqaVaGzwfeR87zq+QaaVOzYzejyvFgs3wh/Lc5xPH+SlQ6NxLbs8ppdT
srrOHV9E2MA4JgqoHaIBMTqWacQ0UfQQ/6qLEFCrps9/0QHs7fg0CAHS5emmgk2v
rD6Mezr5Nx8h5/QJCBEZXfas5lxsICz1EYJ4Pk8QT6IoyeC+fvarGZKvzIQJ3KDN
8yrdv5kCVtN7noREf1KDIqIlYvFbIEoOoglNA40G49e1ffT9Oz6qzTcg19HFO50x
eAIxc9u4KCUY/ASCvcv9biQ5200l7QSCqmR7/Xlj/+4aClKp6Ay058j0awxtHHDy
NlZt6V3XGlm1/SVpvtU/XXWcyJmQwX7kOVIEYOFmt+lEqC7ZBzWEpAaJ8h4DMLLc
PIxIWBSmXNxp6LPNI0dZFf7O6UZ3ZMRacav+HHu7mjWolEB22f4jQJs+RxNhnfLU
fg180YWBMX4V/98S7iigxZkRd+qqQhddYtku+o+bp3h4m6mVrrYNm11J0o0GWQWf
Lio9nlkLq9hkYpdBwkH4PtIv3b+O5f9yhfEYn15eF27Ru0Bob0+DiBkzlflcrJve
W2VfNAj+jxP3Wg0QAI40BSqUB3b+zVtZW5FenAUEK7NxhhPi6jrIsVhhVgGFZIAd
i1xwYUg6fDjielhGOxMTF66ilvduA9uBCFAnTD3iSBoZmF63vew=
=YHhU
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fixes:
- Fix a page lock leak in nfs_pageio_resend()
- Ensure O_DIRECT reports an error if the bytes read/written is 0
- Don't handle errors if the bind/connect succeeded
- Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was
invalidat ed"
Bugfixes:
- Don't refresh attributes with mounted-on-file information
- Fix return values for nfs4_file_open() and nfs_finish_open()
- Fix pnfs layoutstats reporting of I/O errors
- Don't use soft RPC calls for pNFS/flexfiles I/O, and don't abort
for soft I/O errors when the user specifies a hard mount.
- Various fixes to the error handling in sunrpc
- Don't report writepage()/writepages() errors twice"
* tag 'nfs-for-5.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: remove set but not used variable 'mapping'
NFSv2: Fix write regression
NFSv2: Fix eof handling
NFS: Fix writepage(s) error handling to not report errors twice
NFS: Fix spurious EIO read errors
pNFS/flexfiles: Don't time out requests on hard mounts
SUNRPC: Handle connection breakages correctly in call_status()
Revert "NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated"
SUNRPC: Handle EADDRINUSE and ENOBUFS correctly
pNFS/flexfiles: Turn off soft RPC calls
SUNRPC: Don't handle errors if the bind/connect succeeded
NFS: On fatal writeback errors, we need to call nfs_inode_remove_request()
NFS: Fix initialisation of I/O result struct in nfs_pgio_rpcsetup
NFS: Ensure O_DIRECT reports an error if the bytes read/written is 0
NFSv4/pnfs: Fix a page lock leak in nfs_pageio_resend()
NFSv4: Fix return value in nfs_finish_open()
NFSv4: Fix return values for nfs4_file_open()
NFS: Don't refresh attributes with mounted-on-file information
Pull networking fixes from David Miller:
1) Use 32-bit index for tails calls in s390 bpf JIT, from Ilya
Leoshkevich.
2) Fix missed EPOLLOUT events in TCP, from Eric Dumazet. Same fix for
SMC from Jason Baron.
3) ipv6_mc_may_pull() should return 0 for malformed packets, not
-EINVAL. From Stefano Brivio.
4) Don't forget to unpin umem xdp pages in error path of
xdp_umem_reg(). From Ivan Khoronzhuk.
5) Fix sta object leak in mac80211, from Johannes Berg.
6) Fix regression by not configuring PHYLINK on CPU port of bcm_sf2
switches. From Florian Fainelli.
7) Revert DMA sync removal from r8169 which was causing regressions on
some MIPS Loongson platforms. From Heiner Kallweit.
8) Use after free in flow dissector, from Jakub Sitnicki.
9) Fix NULL derefs of net devices during ICMP processing across
collect_md tunnels, from Hangbin Liu.
10) proto_register() memory leaks, from Zhang Lin.
11) Set NLM_F_MULTI flag in multipart netlink messages consistently,
from John Fastabend.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (66 commits)
r8152: Set memory to all 0xFFs on failed reg reads
openvswitch: Fix conntrack cache with timeout
ipv4: mpls: fix mpls_xmit for iptunnel
nexthop: Fix nexthop_num_path for blackhole nexthops
net: rds: add service level support in rds-info
net: route dump netlink NLM_F_MULTI flag missing
s390/qeth: reject oversized SNMP requests
sock: fix potential memory leak in proto_register()
MAINTAINERS: Add phylink keyword to SFF/SFP/SFP+ MODULE SUPPORT
xfrm/xfrm_policy: fix dst dev null pointer dereference in collect_md mode
ipv4/icmp: fix rt dst dev null pointer dereference
openvswitch: Fix log message in ovs conntrack
bpf: allow narrow loads of some sk_reuseport_md fields with offset > 0
bpf: fix use after free in prog symbol exposure
bpf: fix precision tracking in presence of bpf2bpf calls
flow_dissector: Fix potential use-after-free on BPF_PROG_DETACH
Revert "r8169: remove not needed call to dma_sync_single_for_device"
ipv6: propagate ipv6_add_dev's error returns out of ipv6_find_idev
net/ncsi: Fix the payload copying for the request coming from Netlink
qed: Add cleanup in qed_slowpath_start()
...
when spinlock is locked/unlocked, its elements will be changed,
so marking it as __read_mostly is not suitable.
and remove a duplicate definition of nf_conntrack_locks_all_lock
strange that compiler does not complain.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When I merged the extension sysctl tables with the main one I forgot to
reset them on netns creation. They currently read/write init_net settings.
Fixes: d912dec124 ("netfilter: conntrack: merge acct and helper sysctl table with main one")
Fixes: cb2833ed00 ("netfilter: conntrack: merge ecache and timestamp sysctl tables with main one")
Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch implements the delete operation from the ruleset.
It implements a new delete() function in nft_set_rhash. It is simpler
to use than the already existing remove(), because it only takes the set
and the key as arguments, whereas remove() expects a full
nft_set_elem structure.
Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The find_pattern() debug output was printing the 'skip' character.
This can be a NULL-byte and messes up further pr_debug() output.
Output without the fix:
kernel: nf_conntrack_ftp: Pattern matches!
kernel: nf_conntrack_ftp: Skipped up to `<7>nf_conntrack_ftp: find_pattern `PORT': dlen = 8
kernel: nf_conntrack_ftp: find_pattern `EPRT': dlen = 8
Output with the fix:
kernel: nf_conntrack_ftp: Pattern matches!
kernel: nf_conntrack_ftp: Skipped up to 0x0 delimiter!
kernel: nf_conntrack_ftp: Match succeeded!
kernel: nf_conntrack_ftp: conntrack_ftp: match `172,17,0,100,200,207' (20 bytes at 4150681645)
kernel: nf_conntrack_ftp: find_pattern `PORT': dlen = 8
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Simplify the check in physdev_mt_check() to emit an error message
only when passed an invalid chain (ie, NF_INET_LOCAL_OUT).
This avoids cluttering up the log with errors against valid rules.
For large/heavily modified rulesets, current behavior can quickly
overwhelm the ring buffer, because this function gets called on
every change, regardless of the rule that was changed.
Signed-off-by: Todd Seidelmann <tseidelmann@linode.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The in-place decryption routines in AF_RXRPC's rxkad security module
currently call skb_cow_data() to make sure the data isn't shared and that
the skb can be written over. This has a problem, however, as the softirq
handler may be still holding a ref or the Rx ring may be holding multiple
refs when skb_cow_data() is called in rxkad_verify_packet() - and so
skb_shared() returns true and __pskb_pull_tail() dislikes that. If this
occurs, something like the following report will be generated.
kernel BUG at net/core/skbuff.c:1463!
...
RIP: 0010:pskb_expand_head+0x253/0x2b0
...
Call Trace:
__pskb_pull_tail+0x49/0x460
skb_cow_data+0x6f/0x300
rxkad_verify_packet+0x18b/0xb10 [rxrpc]
rxrpc_recvmsg_data.isra.11+0x4a8/0xa10 [rxrpc]
rxrpc_kernel_recv_data+0x126/0x240 [rxrpc]
afs_extract_data+0x51/0x2d0 [kafs]
afs_deliver_fs_fetch_data+0x188/0x400 [kafs]
afs_deliver_to_call+0xac/0x430 [kafs]
afs_wait_for_call_to_complete+0x22f/0x3d0 [kafs]
afs_make_call+0x282/0x3f0 [kafs]
afs_fs_fetch_data+0x164/0x300 [kafs]
afs_fetch_data+0x54/0x130 [kafs]
afs_readpages+0x20d/0x340 [kafs]
read_pages+0x66/0x180
__do_page_cache_readahead+0x188/0x1a0
ondemand_readahead+0x17d/0x2e0
generic_file_read_iter+0x740/0xc10
__vfs_read+0x145/0x1a0
vfs_read+0x8c/0x140
ksys_read+0x4a/0xb0
do_syscall_64+0x43/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fix this by using skb_unshare() instead in the input path for DATA packets
that have a security index != 0. Non-DATA packets don't need in-place
encryption and neither do unencrypted DATA packets.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Reported-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Use the previously-added transmit-phase skbuff private flag to simplify the
socket buffer tracing a bit. Which phase the skbuff comes from can now be
divined from the skb rather than having to be guessed from the call state.
We can also reduce the number of rxrpc_skb_trace values by eliminating the
difference between Tx and Rx in the symbols.
Signed-off-by: David Howells <dhowells@redhat.com>
Add a flag in the private data on an skbuff to indicate that this is a
transmission-phase buffer rather than a receive-phase buffer.
Signed-off-by: David Howells <dhowells@redhat.com>
Abstract out rxtx ring cleanup into its own function from its two callers.
This makes it easier to apply the same changes to both.
Signed-off-by: David Howells <dhowells@redhat.com>
Pass the reference held on a DATA skb in the rxrpc input handler into the
Rx ring rather than getting an additional ref for this and then dropping
the original ref at the end.
Signed-off-by: David Howells <dhowells@redhat.com>
Use the information now cached in the skbuff private data to avoid the need
to reparse a jumbo packet. We can find all the subpackets by dead
reckoning, so it's only necessary to note how many there are, whether the
last one is flagged as LAST_PACKET and whether any have the REQUEST_ACK
flag set.
This is necessary as once recvmsg() can see the packet, it can start
modifying it, such as doing in-place decryption.
Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Improve the information stored about jumbo packets so that we don't need to
reparse them so much later.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffrey Altman <jaltman@auristor.com>
Don't manually take rtnl lock in flower classifier before calling cls
hardware offloads API. Instead, pass rtnl lock status via 'rtnl_held'
parameter.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove dependency on rtnl lock, modify tc_setup_flow_action()
to copy tunnel info, instead of just saving pointer to tunnel_key action
tunnel info. This is necessary to prevent concurrent action overwrite from
releasing tunnel info while it is being used by rtnl-unlocked driver.
Implement helper tcf_tunnel_info_copy() that is used to copy tunnel info
with all its options to dynamically allocated memory block. Modify
tc_cleanup_flow_action() to free dynamically allocated tunnel info.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove dependency on rtnl lock when calling hardware offload
API, take reference to action mirred dev when initializing flow_action
structure in tc_setup_flow_action(). Implement function
tc_cleanup_flow_action(), use it to release the device after hardware
offload API is done using it.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow using new flow_action infrastructure from unlocked
classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held'
argument. Take rtnl lock before accessing tc_action data. This is necessary
to protect from concurrent action replace.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove dependency on rtnl lock from offloads code of
classifiers, take rtnl lock conditionally before executing driver
callbacks. Only obtain rtnl lock if block is bound to devices that require
it.
Block bind/unbind code is rtnl-locked and obtains block->cb_lock while
holding rtnl lock. Obtain locks in same order in tc_setup_cb_*() functions
to prevent deadlock.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend struct flow_block_offload with "unlocked_driver_cb" flag to allow
registering and unregistering block hardware offload callbacks that do not
require caller to hold rtnl lock. Extend tcf_block with additional
lockeddevcnt counter that is incremented for each non-unlocked driver
callback attached to device. This counter is necessary to conditionally
obtain rtnl lock before calling hardware callbacks in following patches.
Register mlx5 tc block offload callbacks as "unlocked".
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To remove dependency on rtnl lock, extend classifier ops with new
ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while
holding cb_lock every time filter if successfully added to or deleted from
hardware.
Implement the new API in flower classifier. Use it to manage hw_filters
list under cb_lock protection, instead of relying on rtnl lock to
synchronize with concurrent fl_reoffload() call.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without rtnl lock protection filters can no longer safely manage block
offloads counter themselves. Refactor cls API to protect block offloadcnt
with tcf_block->cb_lock that is already used to protect driver callback
list and nooffloaddevcnt counter. The counter can be modified by concurrent
tasks by new functions that execute block callbacks (which is safe with
previous patch that changed its type to atomic_t), however, block
bind/unbind code that checks the counter value takes cb_lock in write mode
to exclude any concurrent modifications. This approach prevents race
conditions between bind/unbind and callback execution code but allows for
concurrency for tc rule update path.
Move block offload counter, filter in hardware counter and filter flags
management from classifiers into cls hardware offloads API. Make functions
tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API
private. Implement following new cls API to be used instead:
tc_setup_cb_add() - non-destructive filter add. If filter that wasn't
already in hardware is successfully offloaded, increment block offloads
counter, set filter in hardware counter and flag. On failure, previously
offloaded filter is considered to be intact and offloads counter is not
decremented.
tc_setup_cb_replace() - destructive filter replace. Release existing
filter block offload counter and reset its in hardware counter and flag.
Set new filter in hardware counter and flag. On failure, previously
offloaded filter is considered to be destroyed and offload counter is
decremented.
tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block
offloads counter.
tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and
call tc_cls_offload_cnt_update() if cb() didn't return an error.
Refactor all offload-capable classifiers to atomically offload filters to
hardware, change block offload counter, and set filter in hardware counter
and flag by means of the new cls API functions.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation for running proto ops functions without rtnl lock, change
offload counter type to atomic. This is necessary to allow updating the
counter by multiple concurrent users when offloading filters to hardware
from unlocked classifiers.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock'
rwsem and use it to protect flow_block->cb_list and related counters from
concurrent modification. The lock is taken in read mode for read-only
traversal of cb_list in tc_setup_cb_call() and write mode in all other
cases. This approach ensures that:
- cb_list is not changed concurrently while filters is being offloaded on
block.
- block->nooffloaddevcnt is checked while holding the lock in read mode,
but is only changed by bind/unbind code when holding the cb_lock in write
mode to prevent concurrent modification.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Dorfman reports that after a series of idle disconnects, an
RPC/RDMA transport becomes unusable (rdma_create_qp returns
-ENOMEM). Problem was tracked down to increasing Send Queue size
after each reconnect.
The rdma_create_qp() API does not promise to leave its @qp_init_attr
parameter unaltered. In fact, some drivers do modify one or more of
its fields. Thus our calls to rdma_create_qp must use a fresh copy
of ib_qp_init_attr each time.
This fix is appropriate for kernels dating back to late 2007, though
it will have to be adapted, as the connect code has changed over the
years.
Reported-by: Eli Dorfman <eli@vastdata.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Ensure that the re-establishment delay does not grow exponentially
on each good reconnect. This probably should have been part of
commit 675dd90ad0 ("xprtrdma: Modernize ops->connect").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
If the connection breaks while we're waiting for a reply from the
server, then we want to immediately try to reconnect.
Fixes: ec6017d903 ("SUNRPC fix regression in umount of a secure mount")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
This reverts commit a79f194aa4.
The mechanism for aborting I/O is racy, since we are not guaranteed that
the request is asleep while we're changing both task->tk_status and
task->tk_action.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v5.1
If a connect or bind attempt returns EADDRINUSE, that means we want to
retry with a different port. It is not a fatal connection error.
Similarly, ENOBUFS is not fatal, but just indicates a memory allocation
issue. Retry after a short delay.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Don't handle errors in call_bind_status()/call_connect_status()
if it turns out that a previous call caused it to succeed.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v5.1+
The optimization done in "xprtrdma: Simplify rpcrdma_mr_pop" was a
bit too optimistic. MRs left over after a reconnect still need to
be recycled, not added back to the free list, since they could be
in flight or actually fully registered.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Currently, there is no vlan information (e.g. when used with a vlan aware
bridge) passed to userspache, HWHEADER will contain an 08 00 (ip) suffix
even for tagged ip packets.
Therefore, add an extra netlink attribute that passes the vlan information
to userspace similarly to 15824ab29f for nfqueue.
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces meta matches in the kernel for time (a UNIX timestamp),
day (a day of week, represented as an integer between 0-6), and
hour (an hour in the current day, or: number of seconds since midnight).
All values are taken as unsigned 64-bit integers.
The 'time' keyword is internally converted to nanoseconds by nft in
userspace, and hence the timestamp is taken in nanoseconds as well.
Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce new helper functions to load/store 64-bit values onto/from
registers:
- nft_reg_store64
- nft_reg_load64
This commit also re-orders all these helpers from smallest to largest
target bit size.
Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch addresses a conntrack cache issue with timeout policy.
Currently, we do not check if the timeout extension is set properly in the
cached conntrack entry. Thus, after packet recirculate from conntrack
action, the timeout policy is not applied properly. This patch fixes the
aforementioned issue.
Fixes: 06bd2bdf19 ("openvswitch: Add timeout support to ct action")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using mpls over gre/gre6 setup, rt->rt_gw4 address is not set, the
same for rt->rt_gw_family. Therefore, when rt->rt_gw_family is checked
in mpls_xmit(), neigh_xmit() call is skipped. As a result, such setup
doesn't work anymore.
This issue was found with LTP mpls03 tests.
Fixes: 1550c17193 ("ipv4: Prepare rtable for IPv6 gateway")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
>From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.
Before this commit, run "rds-info -I". The outputs are as
below:
"
RDS IB Connections:
LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
192.2.95.3 192.2.95.1 2 0 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 1 0 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
LocalAddr RemoteAddr Tos SL LocalDev RemoteDev
192.2.95.3 192.2.95.1 2 2 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 1 1 fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9
"
The commit fe3475af3b ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
__u32 rdma_mr_max;
__u32 rdma_mr_size;
__u8 tos;
__u32 cache_allocs;
};
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
uint32_t rdma_mr_max;
uint32_t rdma_mr_size;
uint8_t tos;
uint8_t sl;
uint32_t cache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In the kernel struct, the member variable sl is missing. This will
introduce risks. So it is necessary to use this commit to avoid this risk.
Fixes: fe3475af3b ("net: rds: add per rds connection cache statistics")
CC: Joe Jin <joe.jin@oracle.com>
CC: JUNXIAO_BI <junxiao.bi@oracle.com>
Suggested-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
An excerpt from netlink(7) man page,
In multipart messages (multiple nlmsghdr headers with associated payload
in one byte stream) the first and all following headers have the
NLM_F_MULTI flag set, except for the last header which has the type
NLMSG_DONE.
but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a
FIB dump. The result is user space applications following above man page
excerpt may get confused and may stop parsing msg believing something went
wrong.
In the golang netlink lib [0] the library logic stops parsing believing the
message is not a multipart message. Found this running Cilium[1] against
net-next while adding a feature to auto-detect routes. I noticed with
multiple route tables we no longer could detect the default routes on net
tree kernels because the library logic was not returning them.
Fix this by handling the fib_dump_info_fnhe() case the same way the
fib_dump_info() handles it by passing the flags argument through the
call chain and adding a flags argument to rt_fill_info().
Tested with Cilium stack and auto-detection of routes works again. Also
annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
NLMSG_DONE flags look correct after this.
Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
as is done for fib_dump_info() so this looks correct to me.
[0] https://github.com/vishvananda/netlink/
[1] https://github.com/cilium/
Fixes: ee28906fd7 ("ipv4: Dump route exceptions if requested")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If protocols registered exceeded PROTO_INUSE_NR, prot will be
added to proto_list, but no available bit left for prot in
proto_inuse_idx.
Changes since v2:
* Propagate the error code properly
Signed-off-by: zhanglin <zhang.lin16@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In decode_session{4,6} there is a possibility that the skb dst dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv6
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmpv6_send
- icmpv6_route_lookup
- xfrm_decode_session_reverse
- decode_session4
- oif = skb_dst(skb)->dev->ifindex; <-- here
- decode_session6
- oif = skb_dst(skb)->dev->ifindex; <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the skb_dst(skb)->dev is actually not needed as we
called decode_session{4,6} via xfrm_decode_session_reverse(), so oif is not
used by: fl4->flowi4_oif = reverse ? skb->skb_iif : oif;
So make a dst dev check here should be clean and safe.
v4: No changes.
v3: No changes.
v2: fix the issue in decode_session{4,6} instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.
Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
e,g, with tunnel collect_md mode, which will cause kernel crash.
Here is what the code path looks like, for GRE:
- ip6gre_tunnel_xmit
- ip6gre_xmit_ipv4
- __gre6_xmit
- ip6_tnl_xmit
- if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
- icmp_send
- net = dev_net(rt->dst.dev); <-- here
The reason is __metadata_dst_init() init dst->dev to NULL by default.
We could not fix it in __metadata_dst_init() as there is no dev supplied.
On the other hand, the reason we need rt->dst.dev is to get the net.
So we can just try get it from skb->dev when rt->dst.dev is NULL.
v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
still use dst.dev and do a check to avoid crash.
v3: No changes.
v2: fix the issue in __icmp_send() instead of updating shared dst dev
in {ip_md, ip6}_tunnel_xmit.
Fixes: c8b34e680a ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 06bd2bdf19 ("openvswitch: Add timeout support to ct action")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2019-08-24
An update from ieee802154 for your *net* tree.
Yue Haibing fixed two bugs discovered by KASAN in the hwsim driver for
ieee802154 and Colin Ian King cleaned up a redundant variable assignment.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf 2019-08-24
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix verifier precision tracking with BPF-to-BPF calls, from Alexei.
2) Fix a use-after-free in prog symbol exposure, from Daniel.
3) Several s390x JIT fixes plus BE related fixes in BPF kselftests, from Ilya.
4) Fix memory leak by unpinning XDP umem pages in error path, from Ivan.
5) Fix a potential use-after-free on flow dissector detach, from Jakub.
6) Fix bpftool to close prog fd after showing metadata, from Quentin.
7) BPF kselftest config and TEST_PROGS_EXTENDED fixes, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
test_select_reuseport fails on s390 due to verifier rejecting
test_select_reuseport_kern.o with the following message:
; data_check.eth_protocol = reuse_md->eth_protocol;
18: (69) r1 = *(u16 *)(r6 +22)
invalid bpf_context access off=22 size=2
This is because on big-endian machines casts from __u32 to __u16 are
generated by referencing the respective variable as __u16 with an offset
of 2 (as opposed to 0 on little-endian machines).
The verifier already has all the infrastructure in place to allow such
accesses, it's just that they are not explicitly enabled for
eth_protocol field. Enable them for eth_protocol field by using
bpf_ctx_range instead of offsetof.
Ditto for ip_protocol, bind_inany and len, since they already allow
narrowing, and the same problem can arise when working with them.
Fixes: 2dbb9b9e6d ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Call to bpf_prog_put(), with help of call_rcu(), queues an RCU-callback to
free the program once a grace period has elapsed. The callback can run
together with new RCU readers that started after the last grace period.
New RCU readers can potentially see the "old" to-be-freed or already-freed
pointer to the program object before the RCU update-side NULLs it.
Reorder the operations so that the RCU update-side resets the protected
pointer before the end of the grace period after which the program will be
freed.
Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Timestamps are currently communicated to user space as 'struct
timespec', which is not considered y2038 safe since it uses a 32-bit
signed value for seconds.
Fix this while the API is still not part of any official kernel release
by using 64-bit nanoseconds timestamps instead.
Fixes: ca30707dee ("drop_monitor: Add packet alert mode")
Fixes: 5e58109b1e ("drop_monitor: Add support for packet alert mode for hardware drops")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the RDMA cookie and RX timestamp to the usercopy whitelist.
After the introduction of hardened usercopy whitelisting
(https://lwn.net/Articles/727322/), a warning is displayed when the
RDMA cookie or RX timestamp is copied to userspace:
kernel: WARNING: CPU: 3 PID: 5750 at
mm/usercopy.c:81 usercopy_warn+0x8e/0xa6
[...]
kernel: Call Trace:
kernel: __check_heap_object+0xb8/0x11b
kernel: __check_object_size+0xe3/0x1bc
kernel: put_cmsg+0x95/0x115
kernel: rds_recvmsg+0x43d/0x620 [rds]
kernel: sock_recvmsg+0x43/0x4a
kernel: ___sys_recvmsg+0xda/0x1e6
kernel: ? __handle_mm_fault+0xcae/0xf79
kernel: __sys_recvmsg+0x51/0x8a
kernel: SyS_recvmsg+0x12/0x1c
kernel: do_syscall_64+0x79/0x1ae
When the whitelisting feature was introduced, the memory for the RDMA
cookie and RX timestamp in RDS was not added to the whitelist, causing
the warning above.
Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
Tested-by: Jenny <jenny.x.xu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, ipv6_find_idev returns NULL when ipv6_add_dev fails,
ignoring the specific error value. This results in addrconf_add_dev
returning ENOBUFS in all cases, which is unfortunate in cases such as:
# ip link add dummyX type dummy
# ip link set dummyX mtu 1200 up
# ip addr add 2000::/64 dev dummyX
RTNETLINK answers: No buffer space available
Commit a317a2f19d ("ipv6: fail early when creating netdev named all
or default") introduced error returns in ipv6_add_dev. Before that,
that function would simply return NULL for all failures.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiple batadv_ogm2_packet can be stored in an skbuff. The functions
batadv_v_ogm_send_to_if() uses batadv_v_ogm_aggr_packet() to check if there
is another additional batadv_ogm2_packet in the skb or not before they
continue processing the packet.
The length for such an OGM2 is BATADV_OGM2_HLEN +
batadv_ogm2_packet->tvlv_len. The check must first check that at least
BATADV_OGM2_HLEN bytes are available before it accesses tvlv_len (which is
part of the header. Otherwise it might try read outside of the currently
available skbuff to get the content of tvlv_len.
Fixes: 9323158ef9 ("batman-adv: OGMv2 - implement originators logic")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Multiple batadv_ogm_packet can be stored in an skbuff. The functions
batadv_iv_ogm_send_to_if()/batadv_iv_ogm_receive() use
batadv_iv_ogm_aggr_packet() to check if there is another additional
batadv_ogm_packet in the skb or not before they continue processing the
packet.
The length for such an OGM is BATADV_OGM_HLEN +
batadv_ogm_packet->tvlv_len. The check must first check that at least
BATADV_OGM_HLEN bytes are available before it accesses tvlv_len (which is
part of the header. Otherwise it might try read outside of the currently
available skbuff to get the content of tvlv_len.
Fixes: ef26157747 ("batman-adv: tvlv - basic infrastructure")
Reported-by: syzbot+355cab184197dbbfa384@syzkaller.appspotmail.com
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
an assert and a NULL pointer dereference) plus a small series from Luis
fixing instances of vfree() under spinlock.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl1f2fITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi83fB/0a+TnNY8Q2aEeB9Y/0sckSpRCsMGMV
syt2krwKC0EYM1f2dkJdgCjlSjMzMcHPseP3g5odRXgyPKJt5O9oE7l3vGDC4Oyt
chqhEh86UzG6Kcptx6tIzsAGYS9S4NzxR5sfXF6oRu8m1bwk1n5IhKxYjQDTvAMd
RxwvpdguNA9xvHeUvLMTpy2R3qE3uQ2dxierutW67GeyeCPkvyBmazzi72Q36hlL
y1w8DWaPBemBk5QEM9vmz5i2xQeLO4h4ejhP4LcXyVjJtfvAPl0JWOsHMK4uWRJf
6XjbGDaGYvID0hTQLlEw/k73976HmRxSbaXRtCZN+IG3yWGTL8ID6GqI
=kaFB
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Three important fixes tagged for stable (an indefinite hang, a crash
on an assert and a NULL pointer dereference) plus a small series from
Luis fixing instances of vfree() under spinlock"
* tag 'ceph-for-5.3-rc6' of git://github.com/ceph/ceph-client:
libceph: fix PG split vs OSD (re)connect race
ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply
ceph: clear page dirty before invalidate page
ceph: fix buffer free while holding i_ceph_lock in fill_inode()
ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob()
ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
libceph: allow ceph_buffer_put() to receive a NULL ceph_buffer
Update response packet length for the following commands per NC-SI spec
- Get Controller Packet Statistics
- Get NC-SI Statistics
- Get NC-SI Pass-through Statistics command
Signed-off-by: Ben Wei <benwei@fb.com>
Reviewed-by: Justin Lee <justin.lee1@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The request coming from Netlink should use the OEM generic handler.
The standard command handler expects payload in bytes/words/dwords
but the actual payload is stored in data if the request is coming from Netlink.
Signed-off-by: Justin Lee <justin.lee1@dell.com>
Reviewed-by: Vijay Khemka <vijaykhemka@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add NL80211_CMD_UPDATE_FT_IES to supported commands. In mac80211 drivers,
this can be implemented via existing NL80211_CMD_AUTHENTICATE and
NL80211_ATTR_IE, but non-mac80211 drivers have a separate command for
this. A driver supports FT if it either is mac80211 or supports this
command.
Signed-off-by: Matthew Wang <matthewmwang@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Link: https://lore.kernel.org/r/20190822174806.2954-1-matthewmwang@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently the for-loop will spin forever if variable supported is
non-zero because supported is never changed. Fix this by adding in
the missing right shift of supported.
Addresses-Coverity: ("Infinite loop")
Fixes: 48cb39522a ("mac80211: minstrel_ht: improve rate probing for devices with static fallback")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20190822122034.28664-1-colin.king@canonical.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Variable err is initialized to a value that is never read and it is
re-assigned later. The initialization is redundant and can be removed.
Addresses-Coverity: ("Unused Value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull RCU and LKMM changes from Paul E. McKenney:
- A few more RCU flavor consolidation cleanups.
- Miscellaneous fixes.
- Updates to RCU's list-traversal macros improving lockdep usability.
- Torture-test updates.
- Forward-progress improvements for no-CBs CPUs: Avoid ignoring
incoming callbacks during grace-period waits.
- Forward-progress improvements for no-CBs CPUs: Use ->cblist
structure to take advantage of others' grace periods.
- Also added a small commit that avoids needlessly inflicting
scheduler-clock ticks on callback-offloaded CPUs.
- Forward-progress improvements for no-CBs CPUs: Reduce contention
on ->nocb_lock guarding ->cblist.
- Forward-progress improvements for no-CBs CPUs: Add ->nocb_bypass
list to further reduce contention on ->nocb_lock guarding ->cblist.
- LKMM updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can't rely on ->peer_features in calc_target() because it may be
called both when the OSD session is established and open and when it's
not. ->peer_features is not valid unless the OSD session is open. If
this happens on a PG split (pg_num increase), that could mean we don't
resend a request that should have been resent, hanging the client
indefinitely.
In userspace this was fixed by looking at require_osd_release and
get_xinfo[osd].features fields of the osdmap. However these fields
belong to the OSD section of the osdmap, which the kernel doesn't
decode (only the client section is decoded).
Instead, let's drop this feature check. It effectively checks for
luminous, so only pre-luminous OSDs would be affected in that on a PG
split the kernel might resend a request that should not have been
resent. Duplicates can occur in other scenarios, so both sides should
already be prepared for them: see dup/replay logic on the OSD side and
retry_attempt check on the client side.
Cc: stable@vger.kernel.org
Fixes: 7de030d6b1 ("libceph: resend on PG splits if OSD has RESEND_ON_SPLIT")
Link: https://tracker.ceph.com/issues/41162
Reported-by: Jerry Lee <leisurelysw24@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Jerry Lee <leisurelysw24@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
it expects a unsigned int, but got a __be32
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 93a714d6b5 ("multicast: Extend ip address command to enable
multicast group join/leave on") we added a new flag IFA_F_MCAUTOJOIN
to make user able to add multicast address on ethernet interface.
This works for IPv4, but not for IPv6. See the inet6_addr_add code.
static int inet6_addr_add()
{
...
if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) {
ipv6_mc_config(net->ipv6.mc_autojoin_sk, true...)
}
ifp = ipv6_add_addr(idev, cfg, true, extack); <- always fail with maddr
if (!IS_ERR(ifp)) {
...
} else if (cfg->ifa_flags & IFA_F_MCAUTOJOIN) {
ipv6_mc_config(net->ipv6.mc_autojoin_sk, false...)
}
}
But in ipv6_add_addr() it will check the address type and reject multicast
address directly. So this feature is never worked for IPv6.
We should not remove the multicast address check totally in ipv6_add_addr(),
but could accept multicast address only when IFA_F_MCAUTOJOIN flag supplied.
v2: update commit description
Fixes: 93a714d6b5 ("multicast: Extend ip address command to enable multicast group join/leave on")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- fix uninit-value in batadv_netlink_get_ifindex(), by Eric Dumazet
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl1dRtEWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoeN6D/9IExhEX0OZQFf6fkHJ71yKAN0j
FjBBmumtu33brjFCtieix+jBRGny9qAO80I2q3X2yRQlx/SpHVuXUyHPrN+azNMJ
w2mb7Mw1l3+Y4KbdhM1dz+CTafmdJzgZoFsKIAamD+IP5uIUg1sSL4mLcgka35ol
GFZa82LE7hjTY1sR0azKCbbeAIlwMkoLWy5YsUmIDzJZ6eWJSEt/kEuNIWuhIhl4
VZeT6VH96xWSY1rqwgIiv0fXBuiE/n3//nMY5WquM0yFA1XJBG9jRFhmNcca8oPi
ehodBtZBlHyBXnAnDkSEJY7nux3+jsj4N4X+LDjmv6yWH5EmqMvM3NysXalM6sBj
SZeZyoxJZxl+WE658ZPoVawg3NuABOdOuY6tw/FLvFtBH3rKQzvlz0nXp19m1kef
TrGjCsJsVO/pnrXlQMB0+JnQooRId/UGUTHfgqqP2hySn6a2KcGrxCd7tWxBQ5oW
a3DSmVo7QEwxvEPnIPpxPFv698Rkheb+gQZVrOBjmE4YmdmuaCFUDt04zDRwFjsz
dH+tX3kgQvHZEwVqWylikNqkFDWuako7+N1fZ9DxWZh8tnDe6FXXgQ+DL80nuATS
DGI7ojpAhe8YUyCqOHsCD3lW1hHUmtmzj/N2OrN+U2ROO745RCqs9Co+Kn2ugRhq
YOdgbYJL6aRzu/D52w==
=9t+O
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20190821' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- fix uninit-value in batadv_netlink_get_ifindex(), by Eric Dumazet
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Micro-optimization: In rpcrdma_post_recvs, since commit e340c2d6ef
("xprtrdma: Reduce the doorbell rate (Receive)"), the common case is
to return without doing anything. Found with perf.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Micro-optimization: Save the cost of three function calls during
transport header encoding.
These were "noinline" before to generate more meaningful call stacks
during debugging, but this code is now pretty stable.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For the moment the returned value just happens to be correct because
the current backchannel server implementation does not vary the
number of credits it offers. The spec does permit this value to
change during the lifetime of a connection, however.
The actual maximum is fixed for all RPC/RDMA transports, because
each transport instance has to pre-allocate the resources for
processing BC requests. That's the value that should be returned.
Fixes: 7402a4fedc ("SUNRPC: Fix up backchannel slot table ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Jason Gunthorpe says:
====================
This is a collection of general cleanups for ODP to clarify some of the
flows around umem creation and use of the interval tree.
====================
The branch is based on v5.3-rc5 due to dependencies
* odp_fixes:
RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
RDMA/mlx5: Use ib_umem_start instead of umem.address
RDMA/core: Make invalidate_range a device operation
RDMA/odp: Use kvcalloc for the dma_list and page_list
RDMA/odp: Check for overflow when computing the umem_odp end
RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
RDMA/odp: Split creating a umem_odp from ib_umem_get
RDMA/odp: Make the three ways to create a umem_odp clear
RMDA/odp: Consolidate umem_odp initialization
RDMA/odp: Make it clearer when a umem is an implicit ODP umem
RDMA/odp: Iterate over the whole rbtree directly
RDMA/odp: Use the common interval tree library instead of generic
RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Clean up: The function name should match the documenting comment.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
rpcrdma_rep objects are removed from their free list by only a
single thread: the Receive completion handler. Thus that free list
can be converted to an llist, where a single-threaded consumer and
a multi-threaded producer (rpcrdma_buffer_put) can both access the
llist without the need for any serialization.
This eliminates spin lock contention between the Receive completion
handler and rpcrdma_buffer_get, and makes the rep consumer wait-
free.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: Now that the free list is used sparingly, get rid of the
separate spin lock protecting it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Instead of a globally-contended MR free list, cache MRs in each
rpcrdma_req as they are released. This means acquiring and releasing
an MR will be lock-free in the common case, even outside the
transport send lock.
The original idea of per-rpcrdma_req MR free lists was suggested by
Shirley Ma <shirley.ma@oracle.com> several years ago. I just now
figured out how to make that idea work with on-demand MR allocation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
For 64-bit there is no reason to use vmap/vunmap, so use page_address
as it was initially. For 32 bits, in some apps, like in samples
xdpsock_user.c when number of pgs in use is quite big, the kmap
memory can be not enough, despite on this, kmap looks like is
deprecated in such cases as it can block and should be used rather
for dynamic mm.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
On some devices that only support static rate fallback tables sending rate
control probing packets can be really expensive.
Probing lower rates can already hurt throughput quite a bit. What hurts even
more is the fact that on mt76x0/mt76x2, single probing packets can only be
forced by directing packets at a different internal hardware queue, which
causes some heavy reordering and extra latency.
The reordering issue is mainly problematic while pushing lots of packets to
a particular station. If there is little activity, the overhead of probing is
neglegible.
The static fallback behavior is designed to pretty much only handle rate
control algorithms that use only a very limited set of rates on which the
algorithm switches up/down based on packet error rate.
In order to better support that kind of hardware, this patch implements a
different approach to rate probing where it switches to a slightly higher rate,
waits for tx status feedback, then updates the stats and switches back to
the new max throughput rate. This only triggers above a packet rate of 100
per stats interval (~50ms).
For that kind of probing, the code has to reduce the set of probing rates
a lot more compared to single packet probing, so it uses only one packet
per MCS group which is either slightly faster, or as close as possible to
the max throughput rate.
This allows switching between similar rates with different numbers of
streams. The algorithm assumes that the hardware will work its way lower
within an MCS group in case of retransmissions, so that lower rates don't
have to be probed by the high packets per second rate probing code.
To further reduce the search space, it also does not probe rates with lower
channel bandwidth than the max throughput rate.
At the moment, these changes will only affect mt76x0/mt76x2.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20190820095449.45255-4-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On hardware with static fallback tables (e.g. mt76x2), rate probing attempts
can be very expensive.
On such devices, avoid sampling rates slower than the per-group max throughput
rate, based on the assumption that the fallback table will take care of probing
lower rates within that group if the higher rates fail.
To further reduce unnecessary probing attempts, skip duplicate attempts on
rates slower than the max throughput rate.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20190820095449.45255-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The group number needs to be multiplied by the number of rates per group
to get the full rate index
Fixes: 5935839ad7 ("mac80211: improve minstrel_ht rate sorting by throughput & probability")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20190820095449.45255-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
802.11ay specification defines Enhanced Directional Multi-Gigabit
(EDMG) STA and AP which allow channel bonding of 2 channels and more.
Introduce new NL attributes that are needed for enabling and
configuring EDMG support.
Two new attributes are used by kernel to publish driver's EDMG
capabilities to the userspace:
NL80211_BAND_ATTR_EDMG_CHANNELS - bitmap field that indicates the 2.16
GHz channel(s) that are supported by the driver.
When this attribute is not set it means driver does not support EDMG.
NL80211_BAND_ATTR_EDMG_BW_CONFIG - represent the channel bandwidth
configurations supported by the driver.
Additional two new attributes are used by the userspace for connect
command and for AP configuration:
NL80211_ATTR_WIPHY_EDMG_CHANNELS
NL80211_ATTR_WIPHY_EDMG_BW_CONFIG
New rate info flag - RATE_INFO_FLAGS_EDMG, can be reported from driver
and used for bitrate calculation that will take into account EDMG
according to the 802.11ay specification.
Signed-off-by: Alexei Avshalom Lazar <ailizaro@codeaurora.org>
Link: https://lore.kernel.org/r/1566138918-3823-2-git-send-email-ailizaro@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
he_spr_ie_elem is dereferenced before the NULL check. fix this by moving
the assignment after the check.
fixes commit 697f6c507c ("mac80211: propagate HE operation info into
bss_conf")
This was reported by the static code checker.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190813070712.25509-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Report timestamp for when sta becomes associated.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20190809180001.26393-2-greearb@candelatech.com
[fix ktime_get_boot_ns() to ktime_get_boottime_ns(), assoc_at type to u64]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
For the new 6GHz band the same rules apply for mandatory rates so
add it to set_mandatory_flags_band() function.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-9-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The default mandatory rates, ie. when not specified by user-space, is
determined by the band. Select 11a rateset for 6GHz band.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-8-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The function cfg80211_ir_permissive_chan() is applicable for
6GHz band as well so make sure it is handled.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-7-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In nl80211.c there is a policy for all bands in NUM_NL80211_BANDS and
in trace.h there is a callback trace for multicast rates which is per
band in NUM_NL80211_BANDS. Both need to be extended for the new
NL80211_BAND_6GHZ.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-6-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add 6GHz operating class range as defined in 802.11ax D4.1 Annex E.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-5-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Extend the functions ieee80211_channel_to_frequency() and
ieee80211_frequency_to_channel() to support 6GHz band according
specification in 802.11ax D4.1 27.3.22.2.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-4-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In the 802.11ax specification a new band is introduced, which
is also proposed by FCC for unlicensed use. This band is referred
to as 6GHz spanning frequency range from 5925 to 7125 MHz.
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Leon Zegers <leon.zegers@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/1564745465-21234-2-git-send-email-arend.vanspriel@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This reverts commit 96cce12ff6 ("cfg80211: fix processing world
regdomain when non modular").
Re-triggering a reg_process_hint with the last request on all events,
can make the regulatory domain fail in case of multiple WiFi modules. On
slower boards (espacially with mdev), enumeration of the WiFi modules
can end up in an intersected regulatory domain, and user cannot set it
with 'iw reg set' anymore.
This is happening, because:
- 1st module enumerates, queues up a regulatory request
- request gets processed by __reg_process_hint_driver():
- checks if previous was set by CORE -> yes
- checks if regulator domain changed -> yes, from '00' to e.g. 'US'
-> sends request to the 'crda'
- 2nd module enumerates, queues up a regulator request (which triggers
the reg_todo() work)
- reg_todo() -> reg_process_pending_hints() sees, that the last request
is not processed yet, so it tries to process it again.
__reg_process_hint driver() will run again, and:
- checks if the last request's initiator was the core -> no, it was
the driver (1st WiFi module)
- checks, if the previous initiator was the driver -> yes
- checks if the regulator domain changed -> yes, it was '00' (set by
core, and crda call did not return yet), and should be changed to 'US'
------> __reg_process_hint_driver calls an intersect
Besides, the reg_process_hint call with the last request is meaningless
since the crda call has a timeout work. If that timeout expires, the
first module's request will lost.
Cc: stable@vger.kernel.org
Fixes: 96cce12ff6 ("cfg80211: fix processing world regdomain when non modular")
Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com>
Link: https://lore.kernel.org/r/20190614131600.GA13897@a1-hr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The code generating the Tx Radiotap header when using tx_status_ext was
missing a field increment after setting the VHT bandwidth.
Fixes: 3d07ffcaf3 ("mac80211: add struct ieee80211_tx_status support to ieee80211_add_tx_radiotap_header")
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190807075949.32414-4-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When reporting 80MHz, we need to set 4 and not 2 inside the corresponding
field inside the Tx Radiotap header.
Fixes: 3d07ffcaf3 ("mac80211: add struct ieee80211_tx_status support to ieee80211_add_tx_radiotap_header")
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190807075949.32414-3-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When reporting legacy rates inside the TX Radiotap header we need to split
the check between "uses tx_statua_ext" and "is legacy rate". Not doing so
would make the code drop into the !tx_status_ext path.
Fixes: 3d07ffcaf3 ("mac80211: add struct ieee80211_tx_status support to ieee80211_add_tx_radiotap_header")
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190807075949.32414-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The RX Radiotap header length was not calculated properly when reporting
legacy rates using tx_status_ext.
Fixes: 3d07ffcaf3 ("mac80211: add struct ieee80211_tx_status support to ieee80211_add_tx_radiotap_header")
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190807075949.32414-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Fix two shortcomings in the Extended Key ID API:
1) Allow the userspace to install pairwise keys using keyid 1 without
NL80211_KEY_NO_TX set. This allows the userspace to install and
activate pairwise keys with keyid 1 in the same way as for keyid 0,
simplifying the API usage for e.g. FILS and FT key installs.
2) IEEE 802.11 - 2016 restricts Extended Key ID usage to CCMP/GCMP
ciphers in IEEE 802.11 - 2016 "9.4.2.25.4 RSN capabilities".
Enforce that when installing a key.
Cc: stable@vger.kernel.org # 5.2
Fixes: 6cdd3979a2 ("nl80211/cfg80211: Extended Key ID support")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20190805123400.51567-1-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Probably would be good to also pass GFP flags to ib_alloc_mr.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Refactor: Retrieve an MR and handle error recovery entirely in
rpc_rdma.c, as this is not a device-specific function.
Note that since commit 89f90fe1ad ("SUNRPC: Allow calls to
xprt_transmit() to drain the entire transmit queue"), the
xprt_transmit function handles the cond_resched. The transport no
longer has to do this itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up. There is only one remaining rpcrdma_mr_put call site, and
it can be directly replaced with unmap_and_put because mr->mr_dir is
set to DMA_NONE just before the call.
Now all the call sites do a DMA unmap, and we can just rename
mr_unmap_and_put to mr_put, which nicely matches mr_get.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: rpcrdma_mr_pop call sites check if the list is empty
first. Let's replace the list_empty with less costly logic.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
in ip_mc_inc_group, memory allocation flag, not mcast mode, is expected
by __ip_mc_inc_group
similar issue in __ip_mc_join_group, both mcase mode and gfp_t are needed
here, so use ____ip_mc_inc_group(...)
Fixes: 9fb20801da ("net: Fix ip_mc_{dec,inc}_group allocation context")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The NCSI spec indicates that if the data does not end on a 32 bit
boundary, one to three padding bytes equal to 0x00 shall be present to
align the checksum field to a 32-bit boundary.
Signed-off-by: Terry S. Duncan <terry.s.duncan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Call the .port_enable and .port_disable functions for all ports,
not only the user ports, so that drivers may optimize the power
consumption of all ports after a successful setup.
Unused ports are now disabled on setup. CPU and DSA ports are now
enabled on setup and disabled on teardown. User ports were already
enabled at slave creation and disabled at slave destruction.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is currently difficult to read the different steps involved in the
setup and teardown of ports in the DSA code. Keep it simple with a
single switch statement for each port type: UNUSED, CPU, DSA, or USER.
Also no need to call devlink_port_unregister from within dsa_port_setup
as this step is inconditionally handled by dsa_port_teardown on error.
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we are only explicitly setting SOCK_NOSPACE on a write timeout
for non-blocking sockets. Epoll() edge-trigger mode relies on SOCK_NOSPACE
being set when -EAGAIN is returned to ensure that EPOLLOUT is raised.
Expand the setting of SOCK_NOSPACE to non-blocking sockets as well that can
use SO_SNDTIMEO to adjust their write timeout. This mirrors the behavior
that Eric Dumazet introduced for tcp sockets.
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 48be539dd4 ("xprtrdma: Introduce ->alloc_slot call-out for
xprtrdma") added a separate alloc_slot and free_slot to the RPC/RDMA
transport. Later, commit 75891f502f ("SUNRPC: Support for
congestion control when queuing is enabled") modified the generic
alloc/free_slot methods, but neglected the methods in xprtrdma.
Found via code review.
Fixes: 75891f502f ("SUNRPC: Support for congestion control ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: There are other "all" list heads. For code clarity
distinguish this one as for use only for MRs by renaming it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Make the field name the same for all trace points that handle
pointers to struct rpcrdma_rep. That makes it easy to grep for
matching rep points in trace output.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Although I haven't seen any performance results that justify it,
I've received several complaints that NFS/RDMA no longer supports
a maximum rsize and wsize of 1MB. These days it is somewhat smaller.
To simplify the logic that determines whether a chunk list is
necessary, the implementation uses a fixed maximum size of the
transport header. Currently that maximum size is 256 bytes, one
quarter of the default inline threshold size for RPC/RDMA v1.
Since commit a78868497c ("xprtrdma: Reduce max_frwr_depth"), the
size of chunks is also smaller to take advantage of inline page
lists in device internal MR data structures.
The combination of these two design choices has reduced the maximum
NFS rsize and wsize that can be used for most RNIC/HCAs. Increasing
the maximum transport header size and the maximum number of RDMA
segments it can contain increases the negotiated maximum rsize/wsize
on common RNIC/HCAs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 302d3deb20 ("xprtrdma: Prevent inline overflow") added this
calculation back in 2016, but got it wrong. I tested only the lower
bound, which is why there is a max_t there. The upper bound should be
rounded up too.
Now, when using DIV_ROUND_UP, that takes care of the lower bound as
well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Comment was made obsolete by commit 8cec3dba76 ("xprtrdma:
rpcrdma_regbuf alignment").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Things have changed since this comment was written. In particular,
the reworking of connection closing, on-demand creation of MRs, and
the removal of fr_state all mean that deferring MR recovery to
frwr_map is no longer needed. The description is obsolete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fix mem leak caused by missed unpin routine for umem pages.
Fixes: 8aef7340ae ("xsk: introduce xdp_umem_page")
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Micro-optimization: For xdr_commit_encode call sites in
net/sunrpc/xdr.c, eliminate the extra calling sequence. On my
client, this change saves about a microsecond for every 30 calls
to xdr_reserve_space().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Clean up: commit c544577dad ("SUNRPC: Clean up transport write
space handling") appears to have removed the last caller of
rpc_wake_up_queued_task_on_wq().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
syzbot reported a splat:
xfrm_policy_inexact_list_reinsert+0x625/0x6e0 net/xfrm/xfrm_policy.c:877
CPU: 1 PID: 6756 Comm: syz-executor.1 Not tainted 5.3.0-rc2+ #57
Call Trace:
xfrm_policy_inexact_node_reinsert net/xfrm/xfrm_policy.c:922 [inline]
xfrm_policy_inexact_node_merge net/xfrm/xfrm_policy.c:958 [inline]
xfrm_policy_inexact_insert_node+0x537/0xb50 net/xfrm/xfrm_policy.c:1023
xfrm_policy_inexact_alloc_chain+0x62b/0xbd0 net/xfrm/xfrm_policy.c:1139
xfrm_policy_inexact_insert+0xe8/0x1540 net/xfrm/xfrm_policy.c:1182
xfrm_policy_insert+0xdf/0xce0 net/xfrm/xfrm_policy.c:1574
xfrm_add_policy+0x4cf/0x9b0 net/xfrm/xfrm_user.c:1670
xfrm_user_rcv_msg+0x46b/0x720 net/xfrm/xfrm_user.c:2676
netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2477
xfrm_netlink_rcv+0x74/0x90 net/xfrm/xfrm_user.c:2684
netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
netlink_unicast+0x809/0x9a0 net/netlink/af_netlink.c:1328
netlink_sendmsg+0xa70/0xd30 net/netlink/af_netlink.c:1917
sock_sendmsg_nosec net/socket.c:637 [inline]
sock_sendmsg net/socket.c:657 [inline]
There is no reproducer, however, the warning can be reproduced
by adding rules with ever smaller prefixes.
The sanity check ("does the policy match the node") uses the prefix value
of the node before its updated to the smaller value.
To fix this, update the prefix earlier. The bug has no impact on tree
correctness, this is only to prevent a false warning.
Reported-by: syzbot+8cc27ace5f6972910b31@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The netns sctp feature flags shouldn't work as a global switch,
which is mostly like a firewall/netfilter's job. Also, it will
break asoc as it discard or accept chunks incorrectly when net
sctp.x_enable is changed after the asoc is created.
Since each type of chunk's processing function will check the
corresp asoc's feature flag, this 'global switch' should be
removed, and net sctp.x_enable will only work as the default
feature flags for the future sctp sockets/endpoints.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP_AUTH_SUPPORTED sockopt is used to set enpoint's auth
flag. With this feature, each endpoint will have its own
flag for its future asoc's auth_capable, instead of netns
auth flag.
Note that when both ep's auth_enable is enabled, endpoint
auth related data should be initialized. If asconf_enable
is also set, SCTP_CID_ASCONF/SCTP_CID_ASCONF_ACK should
be added into auth_chunk_list.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to factor out sctp_auth_init and sctp_auth_free
functions, and sctp_auth_init will also be used in the next
patch for SCTP_AUTH_SUPPORTED sockopt.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp has per endpoint auth flag and per asoc auth flag, and
the asoc one should be checked when coming to asoc and the
endpoint one should be checked when coming to endpoint.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SCTP_ASCONF_SUPPORTED sockopt is used to set enpoint's asconf
flag. With this feature, each endpoint will have its own flag
for its future asoc's asconf_capable, instead of netns asconf
flag.
Note that when both ep's asconf_enable and auth_enable are
enabled, SCTP_CID_ASCONF and SCTP_CID_ASCONF_ACK should be
added into auth_chunk_list.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
asconf chunks should be dropped when the asoc doesn't support
asconf feature.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
asoc->peer.asconf_capable is to be set during handshake, and its
value should be initialized to 0. net->sctp.addip_noauth will be
checked in sctp_process_init when processing INIT_ACK on client
and COOKIE_ECHO on server.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is to make addip/asconf flag per endpoint,
and its value is initialized by the per netns flag,
net->sctp.addip_enable.
It also replaces the checks of net->sctp.addip_enable
with ep->asconf_enable in some places.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pointer members of an object with static storage duration, if not
explicitly initialized, will be initialized to a NULL pointer. The
net namespace API checks if this pointer is not NULL before using it,
it are safe to remove the function.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Remove IP MASQUERADING record in MAINTAINERS file,
from Denis Efremov.
2) Counter arguments are swapped in ebtables, from
Todd Seidelmann.
3) Missing netlink attribute validation in flow_offload
extension.
4) Incorrect alignment in xt_nfacct that breaks 32-bits
userspace / 64-bits kernels, from Juliana Rodrigueiro.
5) Missing include guard in nf_conntrack_h323_types.h,
from Masahiro Yamada.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
As Jason Baron explained in commit 790ba4566c ("tcp: set SOCK_NOSPACE
under memory pressure"), it is crucial we properly set SOCK_NOSPACE
when needed.
However, Jason patch had a bug, because the 'nonblocking' status
as far as sk_stream_wait_memory() is concerned is governed
by MSG_DONTWAIT flag passed at sendmsg() time :
long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
value.
This patch removes the 'noblock' variable since we must always
set SOCK_NOSPACE if -EAGAIN is returned.
It also renames the do_nonblock label since we might reach this
code path even if we were in blocking mode.
Fixes: 790ba4566c ("tcp: set SOCK_NOSPACE under memory pressure")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the exports table is changed, exportfs will usually write a new
time to the "flush" file in the nfsd.export cache procfile. This tells
the kernel to flush any entries that are older than that value.
This gives us a mechanism to tell whether an unexport might have
occurred. Add a new ->flush cache_detail operation that is called after
flushing the cache whenever someone writes to a "flush" file.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use a wait-free mechanism for managing the svc_rdma_recv_ctxts free
list. Subsequently, sc_recv_lock can be eliminated.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Clean up: the system workqueue will work just as well.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When running a 64-bit kernel with a 32-bit iptables binary, the size of
the xt_nfacct_match_info struct diverges.
kernel: sizeof(struct xt_nfacct_match_info) : 40
iptables: sizeof(struct xt_nfacct_match_info)) : 36
Trying to append nfacct related rules results in an unhelpful message.
Although it is suggested to look for more information in dmesg, nothing
can be found there.
# iptables -A <chain> -m nfacct --nfacct-name <acct-object>
iptables: Invalid argument. Run `dmesg' for more information.
This patch fixes the memory misalignment by enforcing 8-byte alignment
within the struct's first revision. This solution is often used in many
other uapi netfilter headers.
Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The netlink attribute policy for NFTA_FLOW_TABLE_NAME is missing.
Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The ordering of arguments to the x_tables ADD_COUNTER macro
appears to be wrong in ebtables (cf. ip_tables.c, ip6_tables.c,
and arp_tables.c).
This causes data corruption in the ebtables userspace tools
because they get incorrect packet & byte counts from the kernel.
Fixes: d72133e628 ("netfilter: ebtables: use ADD_COUNTER macro")
Signed-off-by: Todd Seidelmann <tseidelmann@linode.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds initial support for offloading basechains using the
priority range from 1 to 65535. This is restricting the netfilter
priority range to 16-bit integer since this is what most drivers assume
so far from tc. It should be possible to extend this range of supported
priorities later on once drivers are updated to support for 32-bit
integer priorities.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The policy for handling the skb list locks on the send and receive paths
is simple.
- On the send path we never need to grab the lock on the 'xmitq' list
when the destination is an exernal node.
- On the receive path we always need to grab the lock on the 'inputq'
list, irrespective of source node.
However, when transmitting node local messages those will eventually
end up on the receive path of a local socket, meaning that the argument
'xmitq' in tipc_node_xmit() will become the 'ínputq' argument in the
function tipc_sk_rcv(). This has been handled by always initializing
the spinlock of the 'xmitq' list at message creation, just in case it
may end up on the receive path later, and despite knowing that the lock
in most cases never will be used.
This approach is inaccurate and confusing, and has also concealed the
fact that the stated 'no lock grabbing' policy for the send path is
violated in some cases.
We now clean up this by never initializing the lock at message creation,
instead doing this at the moment we find that the message actually will
enter the receive path. At the same time we fix the four locations
where we incorrectly access the spinlock on the send/error path.
This patch also reverts commit d12cffe932 ("tipc: ensure head->lock
is initialised") which has now become redundant.
CC: Eric Dumazet <edumazet@google.com>
Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.
In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.
Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Don't uninstall an XDP program when none is installed, and don't install
an XDP program that has the same ID as the one already installed.
dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
XDP program. It means that the driver's ndo_bpf can be called with
XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
case happens if the user runs `ip link set eth0 xdp off` when there is
no XDP program attached.
The symmetrical case is possible when the user tries to set the program
that is already set.
The drivers typically perform some heavy operations on XDP_SETUP_PROG,
so they all have to handle these cases internally to return early if
they happen. This patch puts this check into the kernel code, so that
all drivers will benefit from it.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit adds support for a new flag called need_wakeup in the
AF_XDP Tx and fill rings. When this flag is set, it means that the
application has to explicitly wake up the kernel Rx (for the bit in
the fill ring) or kernel Tx (for bit in the Tx ring) processing by
issuing a syscall. Poll() can wake up both depending on the flags
submitted and sendto() will wake up tx processing only.
The main reason for introducing this new flag is to be able to
efficiently support the case when application and driver is executing
on the same core. Previously, the driver was just busy-spinning on the
fill ring if it ran out of buffers in the HW and there were none on
the fill ring. This approach works when the application is running on
another core as it can replenish the fill ring while the driver is
busy-spinning. Though, this is a lousy approach if both of them are
running on the same core as the probability of the fill ring getting
more entries when the driver is busy-spinning is zero. With this new
feature the driver now sets the need_wakeup flag and returns to the
application. The application can then replenish the fill queue and
then explicitly wake up the Rx processing in the kernel using the
syscall poll(). For Tx, the flag is only set to one if the driver has
no outstanding Tx completion interrupts. If it has some, the flag is
zero as it will be woken up by a completion interrupt anyway.
As a nice side effect, this new flag also improves the performance of
the case where application and driver are running on two different
cores as it reduces the number of syscalls to the kernel. The kernel
tells user space if it needs to be woken up by a syscall, and this
eliminates many of the syscalls.
This flag needs some simple driver support. If the driver does not
support this, the Rx flag is always zero and the Tx flag is always
one. This makes any application relying on this feature default to the
old behaviour of not requiring any syscalls in the Rx path and always
having to call sendto() in the Tx path.
For backwards compatibility reasons, this feature has to be explicitly
turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
that you always turn it on as it so far always have had a positive
performance impact.
The name and inspiration of the flag has been taken from io_uring by
Jens Axboe. Details about this feature in io_uring can be found in
http://kernel.dk/io_uring.pdf, section 8.3.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new
ndo provides the same functionality as before but with the addition of
a new flags field that is used to specifiy if Rx, Tx or both should be
woken up. The previous ndo only woke up Tx, as implied by the
name. The i40e and ixgbe drivers (which are all the supported ones)
are updated with this new interface.
This new ndo will be used by the new need_wakeup functionality of XDP
sockets that need to be able to wake up both Rx and Tx driver
processing.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Add generic packet traps and groups that can report dropped packets as
well as exceptions such as TTL error.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the basic packet trap infrastructure that allows device drivers to
register their supported packet traps and trap groups with devlink.
Each driver is expected to provide basic information about each
supported trap, such as name and ID, but also the supported metadata
types that will accompany each packet trapped via the trap. The
currently supported metadata type is just the input port, but more will
be added in the future. For example, output port and traffic class.
Trap groups allow users to set the action of all member traps. In
addition, users can retrieve per-group statistics in case per-trap
statistics are too narrow. In the future, the trap group object can be
extended with more attributes, such as policer settings which will limit
the amount of traffic generated by member traps towards the CPU.
Beside registering their packet traps with devlink, drivers are also
expected to report trapped packets to devlink along with relevant
metadata. devlink will maintain packets and bytes statistics for each
packet trap and will potentially report the trapped packet with its
metadata to user space via drop monitor netlink channel.
The interface towards the drivers is simple and allows devlink to set
the action of the trap. Currently, only two actions are supported:
'trap' and 'drop'. When set to 'trap', the device is expected to provide
the sole copy of the packet to the driver which will pass it to devlink.
When set to 'drop', the device is expected to drop the packet and not
send a copy to the driver. In the future, more actions can be added,
such as 'mirror'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop monitor has start and stop commands, but so far these were only
used to start and stop monitoring of software drops.
Now that drop monitor can also monitor hardware drops, we should allow
the user to control these as well.
Do that by adding SW and HW flags to these commands. If no flag is
specified, then only start / stop monitoring software drops. This is
done in order to maintain backward-compatibility with existing user
space applications.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In summary alert mode a notification is sent with a list of recent drop
reasons and a count of how many packets were dropped due to this reason.
To avoid expensive operations in the context in which packets are
dropped, each CPU holds an array whose number of entries is the maximum
number of drop reasons that can be encoded in the netlink notification.
Each entry stores the drop reason and a count. When a packet is dropped
the array is traversed and a new entry is created or the count of an
existing entry is incremented.
Later, in process context, the array is replaced with a newly allocated
copy and the old array is encoded in a netlink notification. To avoid
breaking user space, the notification includes the ancillary header,
which is 'struct net_dm_alert_msg' with number of entries set to '0'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a similar fashion to software drops, extend drop monitor to send
netlink events when packets are dropped by the underlying hardware.
The main difference is that instead of encoding the program counter (PC)
from which kfree_skb() was called in the netlink message, we encode the
hardware trap name. The two are mostly equivalent since they should both
help the user understand why the packet was dropped.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The drop monitor configuration (e.g., alert mode) is global, but user
will be able to enable monitoring of only software or hardware drops.
Therefore, ensure that monitoring of both software and hardware drops are
disabled before allowing drop monitor configuration to take place.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Export a function that can be invoked in order to report packets that
were dropped by the underlying hardware along with metadata.
Subsequent patches will add support for the different alert modes.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like software drops, hardware drops also need the same type of per-CPU
data. Therefore, initialize it during module initialization and
de-initialize it during module exit.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently drop monitor only reports software drops to user space, but
subsequent patches are going to add support for hardware drops.
Like software drops, the per-CPU data of hardware drops needs to be
initialized and de-initialized upon module initialization and exit. To
avoid code duplication, break this code into separate functions, so that
these could be re-used for hardware drops.
No functional changes intended.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johan Hedberg says:
====================
pull request: bluetooth 2019-08-17
Here's a set of Bluetooth fixes for the 5.3-rc series:
- Multiple fixes for Qualcomm (btqca & hci_qca) drivers
- Minimum encryption key size debugfs setting (this is required for
Bluetooth Qualification)
- Fix hidp_send_message() to have a meaningful return value
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently this is needed only for user-space compatibility, so similar
object adds/deletes as the dumped ones would succeed. Later it can be
used for L2 mcast MAC add/delete.
v3: fix compiler warning (DaveM)
v2: don't send a notification when used from user-space, arm the group
timer if no ports are left after host entry del
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we dump only the port mdb entries but we can have host-joined
entries on the bridge itself and they should be treated as normal temp
mdbs, they're already notified:
$ bridge monitor all
[MDB]dev br0 port br0 grp ff02::8 temp
The group will not be shown in the bridge mdb output, but it takes 1 slot
and it's timing out. If it's only host-joined then the mdb show output
can even be empty.
After this patch we show the host-joined groups:
$ bridge mdb show
dev br0 port br0 grp ff02::8 temp
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have to factor out the mdb fill portion in order to re-use it later for
the bridge mdb entries. No functional changes intended.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial patch to move the vlan comments in their proper places above the
vid 0 checks.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Supported PHY features are either auto-detected or explicitly set.
In both cases calling genphy_config_init isn't needed.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For testing and qualification purposes it is useful to allow changing
the minimum encryption key size value that the host stack is going to
enforce. This adds a new debugfs setting min_encrypt_key_size to achieve
this functionality.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
This commit eliminates the use of the link 'stale_limit' & 'prev_from'
(besides the already removed - 'stale_cnt') variables in the detection
of repeated retransmit failures as there is no proper way to initialize
them to avoid a false detection, i.e. it is not really a retransmission
failure but due to a garbage values in the variables.
Instead, a jiffies variable will be added to individual skbs (like the
way we restrict the skb retransmissions) in order to mark the first skb
retransmit time. Later on, at the next retransmissions, the timestamp
will be checked to see if the skb in the link transmq is "too stale",
that is, the link tolerance time has passed, so that a link reset will
be ordered. Note, just checking on the first skb in the queue is fine
enough since it must be the oldest one.
A counter is also added to keep track the actual skb retransmissions'
number for later checking when the failure happens.
The downside of this approach is that the skb->cb[] buffer is about to
be exhausted, however it is always able to allocate another memory area
and keep a reference to it when needed.
Fixes: 77cf8edbc0 ("tipc: simplify stale link failure criteria")
Reported-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl1T5QEACgkQ+7dXa6fL
C2utGA//SeeXoU2E7ER2m66REbD/3+0/4Gwcy/NpNgou+GDLS9zo7ASbcQOmW1uN
MkzEPsL4NuYyJFxJits02nDrqmIpsBKyg+tZCR1CfmmkYrvSHZ5UE8AWr3Ogu9nR
3f6bJ2IhDmuSCQclThHJPaoSoPfLpvrqOMzvdngPzcvrIBX9YKQ91Pye8mEaZTbL
yDWJ8MVN0OJV0X/uUVkaGd/5k/FImjBNV6+kH766Hna0lrCUO4NgwBylFHtyi06I
ZG6zqDtnnZhytN5qkRgtYARqGRoAVubjoHUVywcLreenZ6XzXvo1+ch2aZ7ZFr0m
10oSpjsgn5d2LniHhESxTZBVR+USjuTD/ySLyz5aPA0zE9Qo9qoQPURdZ0xtz4lC
j15gjxXGZBw5c0W47IomGmH2tpUQzOVa8fUWVcZqBa2GS6ByxWzyVpOLd81EnUNQ
MyXOZm44EeM6AE39/xRTYbNw8CGCzbV5gyENosTC0oXFIQjHBFXmHOMZgSIDUZiQ
i09JiQGWafkuQfMM1TRaMICm83q4ObrCbuZTzf0c7vELAACCP10h6kF5KEJSUzuI
Mf7VkVPjwgYx083zdUHzW8k1ZJyB6uOg3NFInZz75mxkHj9kyCfTZPd5qgsAqRY7
MLLM/09PJsjNbHcstJLimJ0tAkHIHT6cx66QkYh4HhV3B7C5ACg=
=zu2P
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix local endpoint handling
Here's a pair of patches that fix two issues in the handling of local
endpoints (rxrpc_local structs):
(1) Use list_replace_init() rather than list_replace() if we're going to
unconditionally delete the replaced item later, lest the list get
corrupted.
(2) Don't access the rxrpc_local object after passing our ref to the
workqueue, not even to illuminate tracepoints, as the work function
may cause the object to be freed. We have to cache the information
beforehand.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
This patchset contains Netfilter fixes for net:
1) Extend selftest to cover flowtable with ipsec, from Florian Westphal.
2) Fix interaction of ipsec with flowtable, also from Florian.
3) User-after-free with bound set to rule that fails to load.
4) Adjust state and timeout for flows that expire.
5) Timeout update race with flows in teardown state.
6) Ensure conntrack id hash calculation use invariants as input,
from Dirk Morris.
7) Do not push flows into flowtable for TCP fin/rst packets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQFHBAABCgAxFiEEmvEkXzgOfc881GuFWsYho5HknSAFAl1TpDwTHG1rbEBwZW5n
dXRyb25peC5kZQAKCRBaxiGjkeSdIMwYB/49sQtE+obVwAOGcXogemkN5+Gj/KTo
KzF0CLYRjVOz2FViGVJL9HQ5QZwy1IxTycIUV2kKnrlI6s0Z17z3G0lV2+J4scuy
O+wVn36QzQHdP3Uj8kGmNQ1v6BliePM8bvm4sHtXzb1diW+qNg/0LmNN4brjPWka
kfvFjS2IdXa7BblT2aiAZj2V+iSQEpy5NieVwcBti0E2AFb+W3qEa8Lg4QORy3Qi
MAHg0QSUHZp0+joXRQiott4WttfuQ17OQFuxzsQ7bQCQhBQWYUmg14Hm/LP4V3g2
9algzS6g3Qapdo7KEJrCSViblzHVD+yL3BemfahNUyK4F8zGkPq48Kp5
=MqG7
-----END PGP SIGNATURE-----
Merge tag 'linux-can-next-for-5.4-20190814' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2019-08-14
this is a pull request for net-next/master consisting of 41 patches.
The first two patches are for the kvaser_pciefd driver: Christer Beskow
removes unnecessary code in the kvaser_pciefd_pwm_stop() function,
YueHaibing removes the unused including of <linux/version.h>.
In the next patch YueHaibing also removes the unused including of
<linux/version.h> in the f81601 driver.
In the ti_hecc driver the next 6 patches are by me and fix checkpatch
warnings. YueHaibing's patch removes an unused variable in the
ti_hecc_mailbox_read() function.
The next 6 patches all target the xilinx_can driver. Anssi Hannula's
patch fixes a chip start failure with an invalid bus. The patch by
Venkatesh Yadav Abbarapu skips an error message in case of a deferred
probe. The 3 patches by Appana Durga Kedareswara rao fix the RX and TX
path for CAN-FD frames. Srinivas Neeli's patch fixes the bit timing
calculations for CAN-FD.
The next 12 patches are by me and several checkpatch warnings in the
af_can, raw and bcm components.
Thomas Gleixner provides a patch for the bcm, which switches the timer
to HRTIMER_MODE_SOFT and removes the hrtimer_tasklet.
Then 6 more patches by me for the gw component, which fix checkpatch
warnings, followed by 2 patches by Oliver Hartkopp to add CAN-FD
support.
The vcan driver gets 3 patches by me, fixing checkpatch warnings.
And finally a patch by Andre Hartmann to fix typos in CAN's netlink
header.
====================
The ctx->sk_write_space pointer is only set when TLS tx mode is enabled.
When running without TX mode its a null pointer but we still set the
sk sk_write_space pointer on close().
Fix the close path to only overwrite sk->sk_write_space when the current
pointer is to the tls_write_space function indicating the tls module should
clean it up properly as well.
Reported-by: Hillf Danton <hdanton@sina.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Fixes: 57c722e932 ("net/tls: swap sk_write_space on close")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__page_pool_get_cached() will return NULL when the ring is
empty, even if there are pages present in the lookaside cache.
It is also possible to refill the cache, and then return a
NULL page.
Restructure the logic so eliminate both cases.
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Original commit from 2011 updated to include a change by
Yuval Shaia <yuval.shaia@oracle.com>
that adds a new statistic counter "send_stuck_rm"
to capture the messages looping exessively
in the send path.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In a previous commit, fields were added to "struct rds_statistics"
but array "rds_stat_names" was not updated accordingly.
Please note the inconsistent naming of the string representations
that is done in the name of compatibility
with the Oracle internal code-base.
s_recv_bytes_added_to_socket -> "recv_bytes_added_to_sock"
s_recv_bytes_removed_from_socket -> "recv_bytes_freed_fromsock"
Fixes: 192a798f52 ("RDS: add stat for socket recv memory usage")
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com>
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This will kick the RDS worker thread if we have been looping
too long.
Original commit from 2012 updated to include a change by
Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
that triggers "must_wake" if "rds_ib_recv_refill_one" fails.
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove variable initializations in functions that
are followed by assignments before use
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support of the socket options RDS6_INFO_SOCKETS and
RDS6_INFO_RECV_MESSAGES which update the RDS_INFO_SOCKETS and
RDS_INFO_RECV_MESSAGES options respectively. The old options work
for IPv4 sockets only.
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
clang warns:
net/netfilter/nft_bitwise.c:138:50: error: size argument in 'memcmp'
call is a comparison [-Werror,-Wmemsize-comparison]
if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
~~~~~~~~~~~~~~~~~~^~
net/netfilter/nft_bitwise.c:138:6: note: did you mean to compare the
result of 'memcmp' instead?
if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
^
)
net/netfilter/nft_bitwise.c:138:32: note: explicitly cast the argument
to size_t to silence this warning
if (memcmp(&priv->xor, &zero, sizeof(priv->xor) ||
^
(size_t)(
1 error generated.
Adjust the parentheses so that the result of the sizeof is used for the
size argument in memcmp, rather than the result of the comparison (which
would always be true because sizeof is a non-zero number).
Fixes: bd8699e9e2 ("netfilter: nft_bitwise: add offload support")
Link: https://github.com/ClangBuiltLinux/linux/issues/638
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
rxrpc_queue_local() attempts to queue the local endpoint it is given and
then, if successful, prints a trace line. The trace line includes the
current usage count - but we're not allowed to look at the local endpoint
at this point as we passed our ref on it to the workqueue.
Fix this by reading the usage count before queuing the work item.
Also fix the reading of local->debug_id for trace lines, which must be done
with the same consideration as reading the usage count.
Fixes: 09d2bf595d ("rxrpc: Add a tracepoint to track rxrpc_local refcounting")
Reported-by: syzbot+78e71c5bab4f76a6a719@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
When a local endpoint (struct rxrpc_local) ceases to be in use by any
AF_RXRPC sockets, it starts the process of being destroyed, but this
doesn't cause it to be removed from the namespace endpoint list immediately
as tearing it down isn't trivial and can't be done in softirq context, so
it gets deferred.
If a new socket comes along that wants to bind to the same endpoint, a new
rxrpc_local object will be allocated and rxrpc_lookup_local() will use
list_replace() to substitute the new one for the old.
Then, when the dying object gets to rxrpc_local_destroyer(), it is removed
unconditionally from whatever list it is on by calling list_del_init().
However, list_replace() doesn't reset the pointers in the replaced
list_head and so the list_del_init() will likely corrupt the local
endpoints list.
Fix this by using list_replace_init() instead.
Fixes: 730c5fd42c ("rxrpc: Fix local endpoint refcounting")
Reported-by: syzbot+193e29e9387ea5837f1d@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
TCP rst and fin packets do not qualify to place a flow into the
flowtable. Most likely there will be no more packets after connection
closure. Without this patch, this flow entry expires and connection
tracking picks up the entry in ESTABLISHED state using the fixup
timeout, which makes this look inconsistent to the user for a connection
that is actually already closed.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the stream outq is not empty, need to kfree nstr_list.
Fixes: d570a59c5b ("sctp: only allow the out stream reset when the stream outq is empty")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
As the annotation says in sctp_do_8_2_transport_strike():
"If the transport error count is greater than the pf_retrans
threshold, and less than pathmaxrtx ..."
It should be transport->error_count checked with pathmaxrxt,
instead of asoc->pf_retrans.
Fixes: 5aa93bcf66 ("sctp: Implement quick failover draft from tsvwg")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next:
1) Rename mss field to mss_option field in synproxy, from Fernando Mancera.
2) Use SYSCTL_{ZERO,ONE} definitions in conntrack, from Matteo Croce.
3) More strict validation of IPVS sysctl values, from Junwei Hu.
4) Remove unnecessary spaces after on the right hand side of assignments,
from yangxingwu.
5) Add offload support for bitwise operation.
6) Extend the nft_offload_reg structure to store immediate date.
7) Collapse several ip_set header files into ip_set.h, from
Jeremy Sowden.
8) Make netfilter headers compile with CONFIG_KERNEL_HEADER_TEST=y,
from Jeremy Sowden.
9) Fix several sparse warnings due to missing prototypes, from
Valdis Kletnieks.
10) Use static lock initialiser to ensure connlabel spinlock is
initialized on boot time to fix sched/act_ct.c, patch
from Florian Westphal.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
It is enough for caller of devlink_compat_switch_id_get() to hold the net
device to guarantee that devlink port is not destroyed concurrently. Remove
rtnl lock assertion and modify comment to warn user that they must hold
either rtnl lock or reference to net device. This is necessary to
accommodate future implementation of rtnl-unlocked TC offloads driver
callbacks.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Daniel Borkmann says:
====================
The following pull-request contains BPF updates for your *net-next* tree.
There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):
for (i = 1; i <= btf__get_nr_types(btf); i++) {
t = (struct btf_type *)btf__type_by_id(btf, i);
if (!has_datasec && btf_is_var(t)) {
/* replace VAR with INT */
t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
<<<<<<< HEAD
/*
* using size = 1 is the safest choice, 4 will be too
* big and cause kernel BTF validation failure if
* original variable took less than 4 bytes
*/
t->size = 1;
*(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
} else if (!has_datasec && kind == BTF_KIND_DATASEC) {
=======
t->size = sizeof(int);
*(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
} else if (!has_datasec && btf_is_datasec(t)) {
>>>>>>> 72ef80b5ee
/* replace DATASEC with STRUCT */
Conflict is between the two commits 1d4126c4e1 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:
[...]
if (!has_datasec && btf_is_var(t)) {
/* replace VAR with INT */
t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
/*
* using size = 1 is the safest choice, 4 will be too
* big and cause kernel BTF validation failure if
* original variable took less than 4 bytes
*/
t->size = 1;
*(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
} else if (!has_datasec && btf_is_datasec(t)) {
/* replace DATASEC with STRUCT */
[...]
The main changes are:
1) Addition of core parts of compile once - run everywhere (co-re) effort,
that is, relocation of fields offsets in libbpf as well as exposure of
kernel's own BTF via sysfs and loading through libbpf, from Andrii.
More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
and http://vger.kernel.org/lpc-bpf2018.html#session-2
2) Enable passing input flags to the BPF flow dissector to customize parsing
and allowing it to stop early similar to the C based one, from Stanislav.
3) Add a BPF helper function that allows generating SYN cookies from XDP and
tc BPF, from Petar.
4) Add devmap hash-based map type for more flexibility in device lookup for
redirects, from Toke.
5) Improvements to XDP forwarding sample code now utilizing recently enabled
devmap lookups, from Jesper.
6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
and Takshak.
7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.
8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.
9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.
10) Add perf event output helper also for other skb-based program types, from Allan.
11) Fix a co-re related compilation error in selftests, from Yonghong.
====================
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Variable rc is initialized to a value that is never read and it is
re-assigned later. The initialization is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Currently the notifications for deleted snapshots are sent only in case
user deletes a snapshot manually. Send the notifications in case region
is destroyed too.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Change ct id hash calculation to only use invariants.
Currently the ct id hash calculation is based on some fields that can
change in the lifetime on a conntrack entry in some corner cases. The
current hash uses the whole tuple which contains an hlist pointer which
will change when the conntrack is placed on the dying list resulting in
a ct id change.
This patch also removes the reply-side tuple and extension pointer from
the hash calculation so that the ct id will will not change from
initialization until confirmation.
Fixes: 3c79107631 ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id")
Signed-off-by: Dirk Morris <dmorris@metaloft.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Introduce CAN FD support which needs an extension of the netlink API to
pass CAN FD type content to the kernel which has a different size to
Classic CAN. Additionally the struct canfd_frame has a new 'flags' element
that can now be modified with can-gw.
The new CGW_FLAGS_CAN_FD option flag defines whether the routing job
handles Classic CAN or CAN FD frames. This setting is very strict at
reception time and enables the new possibilities, e.g. CGW_FDMOD_* and
modifying the flags element of struct canfd_frame, only when
CGW_FLAGS_CAN_FD is set.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
To prepare the CAN FD support this patch implements the first adaptions in
data structures for CAN FD without changing the current functionality.
Additionally some code at the end of this patch is moved or indented to
simplify the review of the next implementation step.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch rewraps the arguments of cgw_put_job() to avoid long lines,
which also fixes the indention.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch removes unnecessary blank lines, and adds suggested ones, so
that checkpatch doesn't complain anymore.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch switches the timer to HRTIMER_MODE_SOFT, which executed the
timer callback in softirq context and removes the hrtimer_tasklet.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch marks the bcm_sock_no_ioctlcmd() function as static as it's
only used in this source file.
Fixes: 473d924d7d ("can: fix ioctl function removal")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch marks the raw_sock_no_ioctlcmd() function as static as it's
only used in this source file.
Fixes: 473d924d7d ("can: fix ioctl function removal")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch balances the braces around else statements, so that
checkpatch doesn't complain anymore.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch removes unnecessary blank lines, and adds suggested ones, so
that checkpatch doesn't complain anymore.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch switches can_pernet_init() to the preferred style of using
the sizeof() operator in kzalloc().
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch joins all error message strings in af_can to be in single
lines, to ease searching for them.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch fixes the alignment of find_dev_rcv_lists() and canfd_rcv()
so that checkpatch doesn't complain anymore.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch balances the braces around else statements, so that
checkpatch doesn't complain anymore.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
seen during boot:
BUG: spinlock bad magic on CPU#2, swapper/0/1
lock: nf_connlabels_lock+0x0/0x60, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
Call Trace:
do_raw_spin_lock+0x14e/0x1b0
nf_connlabels_get+0x15/0x40
ct_init_net+0xc4/0x270
ops_init+0x56/0x1c0
register_pernet_operations+0x1c8/0x350
register_pernet_subsys+0x1f/0x40
tcf_register_action+0x7c/0x1a0
do_one_initcall+0x13d/0x2d9
Problem is that ct action init function can run before
connlabels_init(). Lock has not been initialised yet.
Fix it by using a static initialiser.
Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sparse warns about two tables not being declared.
CHECK net/netfilter/nf_nat_proto.c
net/netfilter/nf_nat_proto.c:725:26: warning: symbol 'nf_nat_ipv4_ops' was not declared. Should it be static?
net/netfilter/nf_nat_proto.c:964:26: warning: symbol 'nf_nat_ipv6_ops' was not declared. Should it be static?
And in fact they can indeed be static.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Sparse rightly complains about undeclared symbols.
CHECK net/netfilter/nft_set_hash.c
net/netfilter/nft_set_hash.c:647:21: warning: symbol 'nft_set_rhash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:670:21: warning: symbol 'nft_set_hash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:690:21: warning: symbol 'nft_set_hash_fast_type' was not declared. Should it be static?
CHECK net/netfilter/nft_set_bitmap.c
net/netfilter/nft_set_bitmap.c:296:21: warning: symbol 'nft_set_bitmap_type' was not declared. Should it be static?
CHECK net/netfilter/nft_set_rbtree.c
net/netfilter/nft_set_rbtree.c:470:21: warning: symbol 'nft_set_rbtree_type' was not declared. Should it be static?
Include nf_tables_core.h rather than nf_tables.h to pick up the additional definitions.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
linux/netfilter/ipset/ip_set.h included four other header files:
include/linux/netfilter/ipset/ip_set_comment.h
include/linux/netfilter/ipset/ip_set_counter.h
include/linux/netfilter/ipset/ip_set_skbinfo.h
include/linux/netfilter/ipset/ip_set_timeout.h
Of these the first three were not included anywhere else. The last,
ip_set_timeout.h, was included in a couple of other places, but defined
inline functions which call other inline functions defined in ip_set.h,
so ip_set.h had to be included before it.
Inlined all four into ip_set.h, and updated the other files that
included ip_set_timeout.h.
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Store immediate data into offload context register. This allows follow
up instructions to take it from the corresponding source register.
This patch is required to support for payload mangling, although other
instructions that take data from source register will benefit from this
too.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Extract mask from bitwise operation and store it into the corresponding
context register so the cmp instruction can set the mask accordingly.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Let hidp_send_message return the number of successfully queued bytes
instead of an unconditional 0.
With the return value fixed to 0, other drivers relying on hidp, such as
hidraw, can not return meaningful values from their respective
implementations of write(). In particular, with the current behavior, a
hidraw device's write() will have different return values depending on
whether the device is connected via USB or Bluetooth, which makes it
harder to abstract away the transport layer.
Signed-off-by: Fabian Henneke <fabian.henneke@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
We set the field 'addr_trial_end' to 'jiffies', instead of the current
value 0, at the moment the node address is initialized. This guarantees
we don't inadvertently enter an address trial period when the node
address is explicitly set by the user.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix rxrpc_unuse_local() to handle a NULL local pointer as it can be called
on an unbound socket on which rx->local is not yet set.
The following reproduced (includes omitted):
int main(void)
{
socket(AF_RXRPC, SOCK_DGRAM, AF_INET);
return 0;
}
causes the following oops to occur:
BUG: kernel NULL pointer dereference, address: 0000000000000010
...
RIP: 0010:rxrpc_unuse_local+0x8/0x1b
...
Call Trace:
rxrpc_release+0x2b5/0x338
__sock_release+0x37/0xa1
sock_close+0x14/0x17
__fput+0x115/0x1e9
task_work_run+0x72/0x98
do_exit+0x51b/0xa7a
? __context_tracking_exit+0x4e/0x10e
do_group_exit+0xab/0xab
__x64_sys_exit_group+0x14/0x17
do_syscall_64+0x89/0x1d4
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reported-by: syzbot+20dee719a2e090427b5f@syzkaller.appspotmail.com
Fixes: 730c5fd42c ("rxrpc: Fix local endpoint refcounting")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeffrey Altman <jaltman@auristor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous patch made the length of the per-CPU skb drop list
configurable. Expose a counter that shows how many packets could not be
enqueued to this list.
This allows users determine the desired queue length.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In packet alert mode, each CPU holds a list of dropped skbs that need to
be processed in process context and sent to user space. To avoid
exhausting the system's memory the maximum length of this queue is
currently set to 1000.
Allow users to tune the length of this queue according to their needs.
The configured length is reported to user space when drop monitor
configuration is queried.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users should be able to query the current configuration of drop monitor
before they start using it. Add a command to query the existing
configuration which currently consists of alert mode and packet
truncation length.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When sending dropped packets to user space it is not always necessary to
copy the entire packet as usually only the headers are of interest.
Allow user to specify the truncation length and add the original length
of the packet as additional metadata to the netlink message.
By default no truncation is performed.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So far drop monitor supported only one alert mode in which a summary of
locations in which packets were recently dropped was sent to user space.
This alert mode is sufficient in order to understand that packets were
dropped, but lacks information to perform a more detailed analysis.
Add a new alert mode in which the dropped packet itself is passed to
user space along with metadata: The drop location (as program counter
and resolved symbol), ingress netdevice and drop timestamp. More
metadata can be added in the future.
To avoid performing expensive operations in the context in which
kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and
queued on per-CPU skb drop list. Then, in process context the netlink
message is allocated, prepared and finally sent to user space.
The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting
the system's memory. Subsequent patches will make this limit
configurable and also add a counter that indicates how many skbs were
tail dropped.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch is going to add another alert mode in which the dropped
packet is notified to user space, instead of only a summary of recent
drops.
Abstract the differences between the modes by adding alert mode
operations. The operations are selected based on the currently
configured mode and associated with the probes and the work item just
before tracing starts.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the configure command does not do anything but return an
error. Subsequent patches will enable the command to change various
configuration options such as alert mode and packet truncation.
Similar to other netlink-based configuration channels, make sure only
users with the CAP_NET_ADMIN capability set can execute this command.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function reset_per_cpu_data() allocates and prepares a new skb for
the summary netlink alert message ('NET_DM_CMD_ALERT'). The new skb is
stored in the per-CPU 'data' variable and the old is returned.
The function is invoked during module initialization and from the
workqueue, before an alert is sent. This means that it is possible to
receive an alert with stale data, if we stopped tracing when the
hysteresis timer ('data->send_timer') was pending.
Instead of invoking the function during module initialization, invoke it
just before we start tracing and ensure we get a fresh skb.
This also allows us to remove the calls to initialize the timer and the
work item from the module initialization path, since both could have
been triggered by the error paths of reset_per_cpu_data().
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The timer and work item are currently initialized once during module
init, but subsequent patches will need to associate different functions
with the work item, based on the configured alert mode.
Allow subsequent patches to make that change by initializing and
de-initializing these objects during tracing enable and disable.
This also guarantees that once the request to disable tracing returns,
no more netlink notifications will be generated.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Subsequent patches will need to enable / disable tracing based on the
configured alerting mode.
Reduce the nesting level and prepare for the introduction of this
functionality by splitting the tracing enable / disable operations into
two different functions.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.
This cleans up a lot of unneeded code and logic around the debugfs wimax
files, making all of this much simpler and easier to understand.
Cc: Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>
Cc: linux-wimax@intel.com
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we swap the original proto and clear the ULP pointer
on close we have to make sure no callback will try to access
the freed state. sk_write_space is not part of sk_prot, remember
to swap it.
Reported-by: syzbot+dcdc9deefaec44785f32@syzkaller.appspotmail.com
Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net/sched/sch_taprio.c:680:32: warning:
entry_list_policy defined but not used [-Wunused-const-variable=]
One of the points of commit a3d43c0d56 ("taprio: Add support adding
an admin schedule") is that it removes support (it now returns "not
supported") for schedules using the TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY
attribute (which were never used), the parsing of those types of schedules
was the only user of this policy. So removing this policy should be fine.
Reported-by: Hulk Robot <hulkci@huawei.com>
Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Generating and retrieving socket cookies are a useful feature that is
exposed to BPF for various program types through bpf_get_socket_cookie()
helper.
The fact that the cookie counter is per netns is quite a limitation
for BPF in practice in particular for programs in host namespace that
use socket cookies as part of a map lookup key since they will be
causing socket cookie collisions e.g. when attached to BPF cgroup hooks
or cls_bpf on tc egress in host namespace handling container traffic
from veth or ipvlan devices with peer in different netns. Change the
counter to be global instead.
Socket cookie consumers must assume the value as opqaue in any case.
Not every socket must have a cookie generated and knowledge of the
counter value itself does not provide much value either way hence
conversion to global is fine.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Martynas Pumputis <m@lambda.lt>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.
Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.
The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The size of the snapshot has to be the same as the size of the region,
therefore no need to pass it again during snapshot creation. Remove the
arg and use region->size instead.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Starting from commit d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
loopback flows got hurt, because for each skb sent, the socket receives an
immediate ACK and sk_flush_backlog() causes extra work.
Intent was to not let the backlog grow too much, but we went a bit too far.
We can check the backlog every 16 skbs (about 1MB chunks)
to increase TCP over loopback performance by about 15 %
Note that the call to sk_flush_backlog() handles a single ACK,
thanks to coalescing done on backlog, but cleans the 16 skbs
found in rtx rb-tree.
Reported-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>