On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.
20.69% [kernel] [k] queued_spin_lock_slowpath
5.64% [kernel] [k] _raw_spin_lock
3.83% [kernel] [k] syscall_return_via_sysret
3.48% [kernel] [k] __entry_text_start
1.76% [kernel] [k] __netif_receive_skb_core
1.64% [kernel] [k] __fget
For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.
In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.
This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.
We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We prefer static_branch_unlikely() over static_key_false() these days.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni says:
====================
net: dev: BYPASS for lockless qdisc
This patch series is aimed at improving xmit performances of lockless qdisc
in the uncontended scenario.
After the lockless refactor pfifo_fast can't leverage the BYPASS optimization.
Due to retpolines the overhead for the avoidables enqueue and dequeue operations
has increased and we see measurable regressions.
The first patch introduces the BYPASS code path for lockless qdisc, and the
second one optimizes such path further. Overall this avoids up to 3 indirect
calls per xmit packet. Detailed performance figures are reported in the 2nd
patch.
v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)
v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With commit c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
pfifo_fast no longer benefit from the TCQ_F_CAN_BYPASS optimization.
Due to retpolines the cost of the enqueue()/dequeue() pair has become
relevant and we observe measurable regression for the uncontended
scenario when the packet-rate is below line rate.
After commit 46b1c18f9d ("net: sched: put back q.qlen into a
single location") we can check for empty qdisc with a reasonably
fast operation even for nolock qdiscs.
This change extends TCQ_F_CAN_BYPASS support to nolock qdisc.
The new chunk of code mirrors closely the existing one for traditional
qdisc, leveraging a newly introduced helper to read atomically the
qdisc length.
Tested with pktgen in queue xmit mode, with pfifo_fast, a MQ
device, and MQ root qdisc:
threads vanilla patched
kpps kpps
1 2465 2889
2 4304 5188
4 7898 9589
Same as above, but with a single queue device:
threads vanilla patched
kpps kpps
1 2556 2827
2 2900 2900
4 5000 5000
8 4700 4700
No mesaurable changes in the contended scenarios, and more 10%
improvement in the uncontended ones.
v1 -> v2:
- rebased after flag name change
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The queue is marked not empty after acquiring the seqlock,
and it's up to the NOLOCK qdisc clearing such flag on dequeue.
Since the empty status lays on the same cache-line of the
seqlock, it's always hot on cache during the updates.
This makes the empty flag update a little bit loosy. Given
the lack of synchronization between enqueue and dequeue, this
is unavoidable.
v2 -> v3:
- qdisc_is_empty() has a const argument (Eric)
v1 -> v2:
- use really an 'empty' flag instead of 'not_empty', as
suggested by Eric
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add documentation to the tcp_ca_state enum, since this enum is
exposed in uapi.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Sowmini Varadhan <sowmini05@gmail.com>
Acked-by: Sowmini Varadhan <sowmini05@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_clock_ns() (aka ktime_get_ns()) is using monotonic clock,
so the checks we had in tcp_mstamp_refresh() are no longer
relevant.
This patch removes cpu stall (when the cache line is not hot)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The tristate prompt should have been replaced rather than defined a few
lines below, rebase mistake.
Fixes: 17cc982176 ("net: phy: Move Omega PHY entry to Cygnus PHY driver")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.
The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.
This reduces the size of e.g. nl80211.o (which has lots of commands):
text data bss dec hex filename
398745 14323 2240 415308 6564c net/wireless/nl80211.o (before)
397913 14331 2240 414484 65314 net/wireless/nl80211.o (after)
--------------------------------
-832 +8 0 -824
Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.
Most of the code transformations were done using the following spatch:
@ops@
identifier OPS;
expression POLICY;
@@
struct genl_ops OPS[] = {
...,
{
- .policy = POLICY,
},
...
};
@@
identifier ops.OPS;
expression ops.POLICY;
identifier fam;
expression M;
@@
struct genl_family fam = {
.ops = OPS,
.maxattr = M,
+ .policy = POLICY,
...
};
This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the call to netif_wake_queue in rtl8169_start_xmit with
netif_start_queue as we don't need to actually wake up the queue since
we are still in mid transmit so we just need to reset the bit so it
doesn't prevent the next transmit.
(Description shamelessly copied from a mail sent by Alex.)
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Aquantia PHY's of the AQR107 family support the downshift feature.
Add support for it as standard PHY tunable so that it can be controlled
via ethtool.
The AQCS109 supports a proprietary 2-pair 1Gbps mode. If two such PHY's
are connected to each other with a 2-pair cable, they may not be able
to establish a link if both advertise modes > 1Gbps.
v2:
- add downshift event detection
- warn if downshift occurred
- read downshifted rate from vendor register
- enable downshift per default on all AQR107 family members
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Buslov says:
====================
Refactor flower classifier to remove dependency on rtnl lock
Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a third step to remove
rtnl lock dependency from TC rules update path.
Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
TC rule update handlers (RTM_NEWTFILTER, RTM_DELTFILTER, etc.) are
already registered with this flag and only take rtnl lock when qdisc or
classifier requires it. Classifiers can indicate that their ops
callbacks don't require caller to hold rtnl lock by setting the
TCF_PROTO_OPS_DOIT_UNLOCKED flag. The goal of this change is to refactor
flower classifier to support unlocked execution and register it with
unlocked flag.
This patch set implements following changes to make flower classifier
concurrency-safe:
- Implement reference counting for individual filters. Change fl_get to
take reference to filter. Implement tp->ops->put callback that was
introduced in cls API patch set to release reference to flower filter.
- Use tp->lock spinlock to protect internal classifier data structures
from concurrent modification.
- Handle concurrent tcf proto deletion by returning EAGAIN, which will
cause cls API to retry and create new proto instance or return error
to the user (depending on message type).
- Handle concurrent insertion of filter with same priority and handle by
returning EAGAIN, which will cause cls API to lookup filter again and
process it accordingly to netlink message flags.
- Extend flower mask with reference counting and protect masks list with
masks_lock spinlock.
- Prevent concurrent mask insertion by inserting temporary value to
masks hash table. This is necessary because mask initialization is a
sleeping operation and cannot be done while holding tp->lock.
Both chain level and classifier level conflicts are resolved by
returning -EAGAIN to cls API that results restart of whole operation.
This retry mechanism is a result of fine-grained locking approach used
in this and previous changes in series and is necessary to allow
concurrent updates on same chain instance. Alternative approach would be
to lock the whole chain while updating filters on any of child tp's,
adding and removing classifier instances from the chain. However, since
most CPU-intensive parts of filter update code are specifically in
classifier code and its dependencies (extensions and hw offloads), such
approach would negate most of the gains introduced by this change and
previous changes in the series when updating same chain instance.
Tcf hw offloads API is not changed by this patch set and still requires
caller to hold rtnl lock. Refactored flower classifier tracks rtnl lock
state by means of 'rtnl_held' flag provided by cls API and obtains the
lock before calling hw offloads. Following patch set will lift this
restriction and refactor cls hw offloads API to support unlocked
execution.
With these changes flower classifier is safely registered with
TCF_PROTO_OPS_DOIT_UNLOCKED flag in last patch.
Changes from V2 to V3:
- Rebase on latest net-next
Changes from V1 to V2:
- Extend cover letter with explanation about retry mechanism.
- Rebase on current net-next.
- Patch 1:
- Use rcu_dereference_raw() for tp->root dereference.
- Update comment in fl_head_dereference().
- Patch 2:
- Remove redundant check in fl_change error handling code.
- Add empty line between error check and new handle assignment.
- Patch 3:
- Refactor loop in fl_get_next_filter() to improve readability.
- Patch 4:
- Refactor __fl_delete() to improve readability.
- Patch 6:
- Fix comment in fl_check_assign_mask().
- Patch 9:
- Extend commit message.
- Fix error code in comment.
- Patch 11:
- Fix fl_hw_replace_filter() to always release rtnl lock in error
handlers.
- Patch 12:
- Don't take rtnl lock before calling __fl_destroy_filter() in
workqueue context.
- Extend commit message with explanation why flower still takes rtnl
lock before calling hardware offloads API.
Github: <https://github.com/vbuslov/linux/tree/unlocked-flower-cong3>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Set TCF_PROTO_OPS_DOIT_UNLOCKED for flower classifier to indicate that its
ops callbacks don't require caller to hold rtnl lock. Don't take rtnl lock
in fl_destroy_filter_work() that is executed on workqueue instead of being
called by cls API and is not affected by setting
TCF_PROTO_OPS_DOIT_UNLOCKED. Rtnl mutex is still manually taken by flower
classifier before calling hardware offloads API that has not been updated
for unlocked execution.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use 'rtnl_held' flag to track if caller holds rtnl lock. Propagate the flag
to internal functions that need to know rtnl lock state. Take rtnl lock
before calling tcf APIs that require it (hw offload, bind filter, etc.).
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct tcf_proto was extended with spinlock to be used by classifiers
instead of global rtnl lock. Use it to protect shared flower classifier
data structures (handle_idr, mask hashtable and list) and fields of
individual filters that can be accessed concurrently. This patch set uses
tcf_proto->lock as per instance lock that protects all filters on
tcf_proto.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without rtnl lock protection tcf proto can be deleted concurrently. Check
tcf proto 'deleting' flag after taking tcf spinlock to verify that no
concurrent deletion is in progress. Return EAGAIN error if concurrent
deletion detected, which will cause caller to retry and possibly create new
instance of tcf proto.
Retry mechanism is a result of fine-grained locking approach used in this
and previous changes in series and is necessary to allow concurrent updates
on same chain instance. Alternative approach would be to lock the whole
chain while updating filters on any of child tp's, adding and removing
classifier instances from the chain. However, since most CPU-intensive
parts of filter update code are specifically in classifier code and its
dependencies (extensions and hw offloads), such approach would negate most
of the gains introduced by this change and previous changes in the series
when updating same chain instance.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Check if user specified a handle and another filter with the same handle
was inserted concurrently. Return EAGAIN to retry filter processing (in
case it is an overwrite request).
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Protect modifications of flower masks list with spinlock to remove
dependency on rtnl lock and allow concurrent access.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Without rtnl lock protection masks with same key can be inserted
concurrently. Insert temporary mask with reference count zero to masks
hashtable. This will cause any concurrent modifications to retry.
Wait for rcu grace period to complete after removing temporary mask from
masks hashtable to accommodate concurrent readers.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend fl_flow_mask structure with reference counter to allow parallel
modification without relying on rtnl lock. Use rcu read lock to safely
lookup mask and increment reference counter in order to accommodate
concurrent deletes.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to prevent double deletion of filter by concurrent tasks when rtnl
lock is not used for synchronization, add 'deleted' filter field. Check
value of this field when modifying filters and return error if concurrent
deletion is detected.
Refactor __fl_delete() to accept pointer to 'last' boolean as argument,
and return error code as function return value instead. This is necessary
to signal concurrent filter delete to caller.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend flower filters with reference counting in order to remove dependency
on rtnl lock in flower ops and allow to modify filters concurrently.
Reference to flower filter can be taken/released concurrently as soon as it
is marked as 'unlocked' by last patch in this series. Use atomic reference
counter type to make concurrent modifications safe.
Always take reference to flower filter while working with it:
- Modify fl_get() to take reference to filter.
- Implement tp->put() callback as fl_put() function to allow cls API to
release reference taken by fl_get().
- Modify fl_change() to assume that caller holds reference to fold and take
reference to fnew.
- Take reference to filter while using it in fl_walk().
Implement helper functions to get/put filter reference counter.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As a preparation for using classifier spinlock instead of relying on
external rtnl lock, rearrange code in fl_change. The goal is to group the
code which changes classifier state in single block in order to allow
following commits in this set to protect it from parallel modification with
tp->lock. Data structures that require tp->lock protection are mask
hashtable and filters list, and classifier handle_idr.
fl_hw_replace_filter() is a sleeping function and cannot be called while
holding a spinlock. In order to execute all sequence of changes to shared
classifier data structures atomically, call fl_hw_replace_filter() before
modifying them.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Flower classifier only changes root pointer during init and destroy. Cls
API implements reference counting for tcf_proto, so there is no danger of
concurrent access to tp when it is being destroyed, even without protection
provided by rtnl lock.
Implement new function fl_head_dereference() to dereference tp->root
without checking for rtnl lock. Use it in all flower function that obtain
head pointer instead of rtnl_dereference().
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NFP driver ABI contains bits for L2 switching which were never
implemented in initially envisioned form.
Remove the defines, and open up the possibility of
reclaiming the bits for other uses.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
NeilBrown says:
====================
Two clean-ups for rhashtable.
These two patches make small improvements to
rhashtable, but are otherwise unrelated.
Thanks to Herbert, Miguel, and Paul for the review.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The pattern set by list.h is that for_each..continue()
iterators start at the next entry after the given one,
while for_each..from() iterators start at the given
entry.
The rht_for_each*continue() iterators are documented as though the
start at the 'next' entry, but actually start at the given entry,
and they are used expecting that behaviour.
So fix the documentation and change the names to *from for consistency
with list.h
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rhashtable_try_insert() currently holds a lock on the bucket in
the first table, while also locking buckets in subsequent tables.
This is unnecessary and looks like a hold-over from some earlier
version of the implementation.
As insert and remove always lock a bucket in each table in turn, and
as insert only inserts in the final table, there cannot be any races
that are not covered by simply locking a bucket in each table in turn.
When an insert call reaches that last table it can be sure that there
is no matchinf entry in any other table as it has searched them all, and
insertion never happens anywhere but in the last table. The fact that
code tests for the existence of future_tbl while holding a lock on
the relevant bucket ensures that two threads inserting the same key
will make compatible decisions about which is the "last" table.
This simplifies the code and allows the ->rehash field to be
discarded.
We still need a way to ensure that a dead bucket_table is never
re-linked by rhashtable_walk_stop(). This can be achieved by calling
call_rcu() inside the locked region, and checking with
rcu_head_after_call_rcu() in rhashtable_walk_stop() to see if the
bucket table is empty and dead.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
net: phy: Move Omega PHY entry to Cygnus PHY driver
In order to pave the way for adding some specific Omega PHY features
that may not be desirable on other products covered by the bcm7xxx PHY
driver, split the Omega PHY entry into the Cygnus PHY driver such that
the PHY drivers are reflective of product lines/business units
maintaining them within Broadcom.
No functional changes intended.
====================
Acked-by: Arun Parameswaran <arun.parameswaran@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cygnus and Omega are part of the same business unit and product line, it
makes sense to group PHY entries by products such that a platform can
select only the drivers that it needs. Bring all the functionality that
the BCM7XXX_28NM_GPHY() macro hides for us and remove the Omega PHY
entry from bcm7xxx.c.
As an added bonus, we now have a proper mdio_device_id entry to permit
auto-loading.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Omega PHY entry was added to bcm7xxx.c out of convenience and this
breaks the one driver per product line paradigm that was applied up
until now. Since the AFE initialization is shared between Omega and
BCM7xxx move the relevant functions to bcm-phy-lib.[ch]. No functional
changes introduced.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Get rid of some obsolete gc-related documentation and macros that were
missed in commit 5b7c9a8ff8 ("net: remove dst gc related code").
CC: Wei Wang <weiwan@google.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
net: broadcom: Remove print of base address
Some broadcom MDIO/switch/Ethernet MAC drivers insist on printing the
base register virtual address which has little value.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit ad67b74d24 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, especially given we use a dev_info() which
already displays the platform device's address.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit ad67b74d24 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, we use a dev_info() print which already
displays the platform device's address.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit ad67b74d24 ("printk: hash addresses printed with %p")
pointers are being hashed when printed. Displaying the virtual memory at
bootup time is not helpful, especially given we use a dev_info() which
already displays the platform device's address.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
net and null_fallback are redundant. Remove null_fallback in favor of
!net check.
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
fib_trie implementation calls synchronize_rcu when a certain amount of
pages are dirty from freed entries. The number of pages was determined
experimentally in 2009 (commit c3059477fc).
At the current setting, synchronize_rcu is called often -- 51 times in a
second in one test with an average of an 8 msec delay adding a fib entry.
The total impact is a lot of slow down modifying the fib. This is seen
in the output of 'time' - the difference between real time and sys+user.
For example, using 720,022 single path routes and 'ip -batch'[1]:
$ time ./ip -batch ipv4/routes-1-hops
real 0m14.214s
user 0m2.513s
sys 0m6.783s
So roughly 35% of the actual time to install the routes is from the ip
command getting scheduled out, most notably due to synchronize_rcu (this
is observed using 'perf sched timehist').
This patch makes the amount of dirty memory configurable between 64k where
the synchronize_rcu is called often (small, low end systems that are memory
sensitive) to 64M where synchronize_rcu is called rarely during a large
FIB change (for high end systems with lots of memory). The default is 512kB
which corresponds to the current setting of 128 pages with a 4kB page size.
As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
in a second blocking for up to 30 msec in a single instance, and a total
of almost 100 msec across the 4 calls in the second. The trade off is
allowing FIB entries to consume more memory in a given time window but
but with much better fib insertion rates (~30% increase in prefixes/sec).
With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
file runs in:
$ time ./ip -batch ipv4/routes-1-hops
real 0m9.692s
user 0m2.491s
sys 0m6.769s
So the dead time is reduced to about 1/2 second or <5% of the real time.
[1] 'ip' modified to not request ACK messages which improves route
insertion times by about 20%
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit f2780d6d74 "tun: Add ioctl() SIOCGSKNS cmd to allow
obtaining net ns of tun device" it was missed that tun may change
its net ns, while net ns of socket remains the same as it was
created initially. SIOCGSKNS returns net ns of socket, so it is
not suitable for obtaining net ns of device.
We may have two tun devices with the same names in two net ns,
and in this case it's not possible to determ, which of them
fd refers to (TUNGETIFF will return the same name).
This patch adds new ioctl() cmd for obtaining net ns of a device.
Reported-by: Harald Albrecht <harald.albrecht@gmx.net>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern says:
====================
ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create
addrconf_f6i_alloc is the last caller of fib6_info_alloc besides
ip6_route_info_create. There really is no good reason for it do
its own fib6_info initialization, so convert it to call
ip6_route_info_create.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Change addrconf_f6i_alloc to generate a fib6_config and call
ip6_route_info_create. addrconf_f6i_alloc is the last caller to
fib6_info_alloc besides ip6_route_info_create, and there is no
reason for it to do its own initialization on a fib6_info.
Host routes need to be created even if the device is down, so add a
new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
to not error out if device is not up.
Notes on the conversion:
- ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
and fc_mx_len set to 0
- dst_nocount is handled by the RTF_ADDRCONF flag
- dst_host is handled by fc_dst_len = 128
nh_gw does not get set after the conversion to ip6_route_info_create
but it should not be set in addrconf_f6i_alloc since this is a host
route not a gateway route.
Everything else is a straight forward map between fib6_info and
fib6_config.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_route_info_create is a low level function for ensuring fc_metric is
set. Move the check and default setting to the 2 locations that do not
already set fc_metric before calling ip6_route_info_create. This is
required for the next patch which moves addrconf allocations to
ip6_route_info_create and want the metric for host routes to be 0.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
To free the skb in normal course of processing, consume_skb() should be
used. Only for failure paths, skb_free() is intended to be used.
https://www.kernel.org/doc/htmldocs/networking/API-consume-skb.html
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
skb free-ed in:
1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv
2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg
This leads to a "use-after-free" access in the next condition.
We fix this by intializing the variable at declaration, then it is safe
to check this variable to continue processing if condition matches.
syzbot report:
==================================================================
BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0
net/tipc/socket.c:2167
Read of size 4 at addr ffff88808ea58534 by task kworker/u4:0/7
CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Workqueue: tipc_send tipc_conn_send_work
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x172/0x1f0 lib/dump_stack.c:113
print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
__asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167
tipc_sk_enqueue net/tipc/socket.c:2254 [inline]
tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305
tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610
tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283
tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303
process_one_work+0x98e/0x1790 kernel/workqueue.c:2269
worker_thread+0x98/0xe40 kernel/workqueue.c:2415
kthread+0x357/0x430 kernel/kthread.c:253
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com
Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
In addition to icmp_echo_ignore_multicast, there is a need to also
prevent responding to pings to anycast addresses for security.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix sparse warnings:
drivers/isdn/i4l/isdn_ppp.c:1891:16: warning:
symbol 'isdn_ppp_mp_discard' was not declared. Should it be static?
drivers/isdn/i4l/isdn_ppp.c:1903:6: warning:
symbol 'isdn_ppp_mp_reassembly' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix sparse warning:
drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c:414:6:
warning: symbol 'hclge_destroy_cmd_queue' was not declared. Should it be static?
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni says:
====================
net: refactor ndo_select_queue()
Currently, on most devices implementing ndo_select_queue(), we get 2
indirect calls per xmit packet, at least in some scenarios.
We can avoid one of such indirect calls refactoring the ndo_select_queue()
usage so that we don't need anymore the 'fallback' argument.
The first patch renames a helper used later as a public API, the second one
changes the af packet implementation so that it uses the common infrastructure
to select the xmit queue, and the second patch drops the now unneeded argument
from ndo_select_queue().
Alternatively we could use the INDIRECT_CALL_WRAPPER infrastructure to avoid
the fallback indirect call in the common case, but this solution allows also
for some code cleanup.
v1 -> v2:
- renamed select queue helpers, as per Eric's and David's suggestions
====================
Signed-off-by: David S. Miller <davem@davemloft.net>