Introduce instrument_copy_from_user_before() and
instrument_copy_from_user_after() hooks to be invoked before and after the
call to copy_from_user().
KASAN and KCSAN will be only using instrument_copy_from_user_before(), but
for KMSAN we'll need to insert code after copy_from_user().
Link: https://lkml.kernel.org/r/20220915150417.722975-4-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Some users (currently only KMSAN) may want to use spare bits in
depot_stack_handle_t. Let them do so by adding @extra_bits to
__stack_depot_save() to store arbitrary flags, and providing
stack_depot_get_extra_bits() to retrieve those flags.
Also adapt KASAN to the new prototype by passing extra_bits=0, as KASAN
does not intend to store additional information in the stack handle.
Link: https://lkml.kernel.org/r/20220915150417.722975-3-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
HMM selftests use an in-kernel pseudo device to emulate device memory.
The pseudo device registers a major device range for two or four pseudo
device instances. User space has a script that reads /proc/devices in
order to find the assigned major number, and sends that to mknod(1), once
for each node.
Change this to properly use cdev and struct device APIs.
Delete the /proc/devices parsing from the user-space test script, now that
it is unnecessary.
Also, delete an unused field in struct dmirror_device: devmem.
Link: https://lkml.kernel.org/r/20220826050631.25771-1-mpenttil@redhat.com
Signed-off-by: Mika Penttilä <mpenttil@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new use-after-free test that checks that KASAN detects
use-after-free when another object was allocated in the same slot.
This test is mainly relevant for the tag-based modes, which do not use
quarantine.
Once [1] is resolved, this test can be extended to check that the stack
traces in the report point to the proper kmalloc/kfree calls.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=212203
Link: https://lkml.kernel.org/r/0659cfa15809dd38faa02bc0a59d0b5dbbd81211.1662411800.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Drop CONFIG_KASAN_TAGS_IDENTIFY and related code to simplify making
changes to the reporting code.
The dropped functionality will be restored in the following patches in
this series.
Link: https://lkml.kernel.org/r/4c66ba98eb237e9ed9312c19d423bbcf4ecf88f8.1662411799.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Peter Collingbourne <pcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christophe Leroy reported a ~80ms latency spike
happening at first TCP connect() time.
This is because __inet_hash_connect() uses get_random_once()
to populate a perturbation table which became quite big
after commit 4c2c8f03a5 ("tcp: increase source port perturb table to 2^16")
get_random_once() uses DO_ONCE(), which block hard irqs for the duration
of the operation.
This patch adds DO_ONCE_SLOW() which uses a mutex instead of a spinlock
for operations where we prefer to stay in process context.
Then __inet_hash_connect() can use get_random_slow_once()
to populate its perturbation table.
Fixes: 4c2c8f03a5 ("tcp: increase source port perturb table to 2^16")
Fixes: 190cc82489 ("tcp: change source port randomizarion at connect() time")
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/netdev/CANn89iLAEYBaoYajy0Y9UmGFff5GPxDUoG-ErVB2jDdRNQ5Tug@mail.gmail.com/T/#t
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
With CONFIG_ZSTD_COMPRESS=m and CONFIG_ZSTD_DECOMPRESS=y we end up in
a situation when files from lib/zstd/common/ are compiled once to be
linked later for ZSTD_DECOMPRESS (build-in) and ZSTD_COMPRESS (module)
even though CFLAGS are different for builtins and modules.
So far somehow this was not a problem but enabling LLVM LTO exposes
the problem as:
ld.lld: error: linking module flags 'Code Model': IDs have conflicting values in 'lib/built-in.a(zstd_common.o at 5868)' and 'ld-temp.o'
This particular conflict is caused by KBUILD_CFLAGS=-mcmodel=medium vs.
KBUILD_CFLAGS_MODULE=-mcmodel=large , modules use the large model on
POWERPC as explained at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Makefile?h=v5.18-rc4#n127
but the current use of common files is wrong anyway.
This works around the issue by introducing a zstd_common module with
shared code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
The helper is better optimized for the worst case: in case of empty
cpumask, current code traverses 2 * size:
next = cpumask_next_and(prev, src1p, src2p);
if (next >= nr_cpu_ids)
next = cpumask_first_and(src1p, src2p);
At bitmap level we can stop earlier after checking 'size + offset' bits.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Replace URL with an updated path to the full Documentation page
Signed-off-by: Tales Aparecida <tales.aparecida@gmail.com>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Replace URL with an updated path to the full Documentation page
Signed-off-by: Tales Aparecida <tales.aparecida@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
This patch adds the kunit.enable module parameter that will need to be
set to true in addition to KUNIT being enabled for KUnit tests to run.
The default value is true giving backwards compatibility. However, for
the production+testing use case the new config option
KUNIT_DEFAULT_ENABLED can be set to N requiring the tester to opt-in
by passing kunit.enable=1 to the kernel.
Signed-off-by: Joe Fradley <joefradley@google.com>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Commit 4acb83417c ("sbitmap: fix batched wait_cnt accounting")
is a big improvement: without it, I had to revert to before commit
040b83fcec ("sbitmap: fix possible io hung due to lost wakeup")
to avoid the high system time and freezes which that had introduced.
Now okay on the NVME laptop, but 4acb83417c is a disaster for heavy
swapping (kernel builds in low memory) on another: soon locking up in
sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
around with waitqueue_active() but wait_cnt 0 . Here is a backtrace,
showing the common pattern of outer sbitmap_queue_wake_up() interrupted
before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
are idle, in other cases they're spinning for a lock in dd_bio_merge()):
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
scsi_end_request < scsi_io_completion < scsi_finish_command <
scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
__irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
_raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
__blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
__submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
submit_bio < __swap_writepage < swap_writepage < pageout <
shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
__alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
do_anonymous_page < __handle_mm_fault < handle_mm_fault <
do_user_addr_fault < exc_page_fault < asm_exc_page_fault
See how the process-context sbitmap_queue_wake_up() has been interrupted,
after bringing wait_cnt down to 0 (and in this example, after doing its
wakeups), before advancing wake_index and refilling wake_cnt: an
interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
I have almost no grasp of all the possible sbitmap races, and their
consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
so it is better if sbq_wake_ptr() skips on to the next ws in that case:
which fixes the lockup and shows no adverse consequence for me.
The check for wait_cnt being 0 is obviously racy, and ultimately can lead
to lost wakeups: for example, when there is only a single waitqueue with
waiters. However, lost wakeups are unlikely to matter in these cases,
and a proper fix requires redesign (and benchmarking) of the batched
wakeup code: so let's plug the hole with this bandaid for now.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The printk code invokes vnsprintf in order to compute the complete
string before adding it into its buffer. This happens in an IRQ-off
region which leads to a warning on PREEMPT_RT in the random code if the
format strings contains a %p for pointer printing. This happens because
the random core acquires locks which become sleeping locks on PREEMPT_RT
which must not be acquired with disabled interrupts and or preemption
disabled.
By default the pointers are hashed which requires a random value on the
first invocation (either by printk or another user which comes first.
One could argue that there is no need for printk to disable interrupts
during the vsprintf() invocation which would fix the just mentioned
problem. However printk itself can be invoked in a context with
disabled interrupts which would lead to the very same problem.
Move the initialization of ptr_key into a worker and schedule it from
subsys_initcall(). This happens early but after the workqueue subsystem
is ready. Use get_random_bytes() to retrieve the random value if the RNG
core is ready, otherwise schedule a worker in two seconds and try again.
Another advantage is that it removes a lock from the vsprintf() code path.
It prevents a possible deadlock when printk("%p", ptr) is called under
the lock taken in get_random_bytes().
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
[pmladek@suse.com: Added a note about the it prevented a possible deadlock in printk().]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220927104912.622645-3-bigeasy@linutronix.de
Using static_branch_likely() to signal that ptr_key has been filled is a
bit much given that it is not a fast path.
Replace static_branch_likely() with bool for condition and a memory
barrier for ptr_key.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220927104912.622645-2-bigeasy@linutronix.de
Having most of the new files in place, we now enable Rust support
in the build system, including `Kconfig` entries related to Rust,
the Rust configuration printer and a few other bits.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com>
Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com>
Co-developed-by: Finn Behrens <me@kloenk.de>
Signed-off-by: Finn Behrens <me@kloenk.de>
Co-developed-by: Adam Bratschi-Kaye <ark.email@gmail.com>
Signed-off-by: Adam Bratschi-Kaye <ark.email@gmail.com>
Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com>
Co-developed-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Co-developed-by: Sven Van Asbroeck <thesven73@gmail.com>
Signed-off-by: Sven Van Asbroeck <thesven73@gmail.com>
Co-developed-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Gary Guo <gary@garyguo.net>
Co-developed-by: Boris-Chengbiao Zhou <bobo1239@web.de>
Signed-off-by: Boris-Chengbiao Zhou <bobo1239@web.de>
Co-developed-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Co-developed-by: Douglas Su <d0u9.su@outlook.com>
Signed-off-by: Douglas Su <d0u9.su@outlook.com>
Co-developed-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl>
Signed-off-by: Dariusz Sosnowski <dsosnowski@dsosnowski.pl>
Co-developed-by: Antonio Terceiro <antonio.terceiro@linaro.org>
Signed-off-by: Antonio Terceiro <antonio.terceiro@linaro.org>
Co-developed-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Co-developed-by: Björn Roy Baron <bjorn3_gh@protonmail.com>
Signed-off-by: Björn Roy Baron <bjorn3_gh@protonmail.com>
Co-developed-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
This patch adds a format specifier `%pA` to `vsprintf` which formats
a pointer as `core::fmt::Arguments`. Doing so allows us to directly
format to the internal buffer of `printf`, so we do not have to use
a temporary buffer on the stack to pre-assemble the message on
the Rust side.
This specifier is intended only to be used from Rust and not for C, so
`checkpatch.pl` is intentionally unchanged to catch any misuse.
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Alex Gaynor <alex.gaynor@gmail.com>
Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com>
Co-developed-by: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Gary Guo <gary@garyguo.net>
Co-developed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
While discussing early DMA pool pre-allocation failure with Christoph [1]
I have realized that the allocation failure warning is rather noisy for
constrained allocations like GFP_DMA{32}. Those zones are usually not
populated on all nodes very often as their memory ranges are constrained.
This is an attempt to reduce the ballast that doesn't provide any relevant
information for those allocation failures investigation. Please note that
I have only compile tested it (in my default config setup) and I am
throwing it mostly to see what people think about it.
[1] http://lkml.kernel.org/r/20220817060647.1032426-1-hch@lst.de
[mhocko@suse.com: update]
Link: https://lkml.kernel.org/r/Yw29bmJTIkKogTiW@dhcp22.suse.cz
[mhocko@suse.com: fix build]
[akpm@linux-foundation.org: fix it for mapletree]
[akpm@linux-foundation.org: update it for Michal's update]
[mhocko@suse.com: fix arch/powerpc/xmon/xmon.c]
Link: https://lkml.kernel.org/r/Ywh3C4dKB9B93jIy@dhcp22.suse.cz
[akpm@linux-foundation.org: fix arch/sparc/kernel/setup_32.c]
Link: https://lkml.kernel.org/r/YwScVmVofIZkopkF@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code. Remove the vmacache
to reduce the work in keeping it up to date and code complexity.
Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is a test suite that uses the radix test infrastructure. It has been
split into its own commit to allow for easier review of the maple tree
code.
The testing includes:
- Allocation of nodes
- gfp flag allocation checks
- Expansion & contraction of tree
- preallocation checks
- tree navigation by next/prev
- tree navigation by iterators (mas_for_each, etc)
- Number of nodes for a given number of entries
- Generic tree construction tests
- Addition and removal of entries in forward and reverse numerical indexes
- gap searching both forward and reverse
- Combining gaps by overwriting entries in different ways
- splitting right-most node
- splitting left-most node
- overwriting multiple slots
- overwriting across different levels of the tree
- overwriting the middle of a tree
- causing a 3-way split up to the root by overwriting the last slot and
first slot of different nodes and spanning different levels
- RCU stress testing of the tree with threads
- Duplication of the tree by entry count
- Tests which were generated by fuzzers have been added.
- A large number of tests which come from recording crashing in a VM and
reconstructing the tree (see check_erase2_set())
Link: https://lkml.kernel.org/r/20220906194824.2110408-8-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Introducing the Maple Tree"
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Davidlor said
: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage. Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit. This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead. Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.
A similar work has been discovered in the academic press
https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
Sheer coincidence. We designed our tree with the intention of solving the
hardest problem first. Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article. So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.
This patch (of 70):
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
There is additional BUG_ON() calls added within the tree, most of which
are in debug code. These will be replaced with a WARN_ON() call in the
future. There is also additional BUG_ON() calls within the code which
will also be reduced in number at a later date. These exist to catch
things such as out-of-range accesses which would crash anyways.
Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add cpumask_nth_{,and,andnot} as wrappers around corresponding
find functions, and use it in cpumask_local_spread().
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Kernel lacks for a function that searches for Nth bit in a bitmap.
Usually people do it like this:
for_each_set_bit(bit, mask, size)
if (n-- == 0)
return bit;
We can do it more efficiently, if we:
1. find a word containing Nth bit, using hweight(); and
2. find the bit, using a helper fns(), that works similarly to
__ffs() and ffz().
fns() is implemented as a simple loop. For x86_64, there's PDEP instruction
to do that: ret = clz(pdep(1 << idx, num)). However, for large bitmaps the
most of improvement comes from using hweight(), so I kept fns() simple.
New find_nth_bit() is ~70 times faster on x86_64/kvm in find_bit benchmark:
find_nth_bit: 7154190 ns, 16411 iterations
for_each_bit: 505493126 ns, 16315 iterations
With all that, a family of 3 new functions is added, and used where
appropriate in the following patches.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
The function calculates Hamming weight of (bitmap1 & bitmap2). Now we
have to do like this:
tmp = bitmap_alloc(nbits);
bitmap_and(tmp, map1, map2, nbits);
weight = bitmap_weight(tmp, nbits);
bitmap_free(tmp);
This requires additional memory, adds pressure on alloc subsystem, and
way less cache-friendly than just:
weight = bitmap_weight_and(map1, map2, nbits);
The following patches apply it for cpumask functions.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
__bitmap_weight() is not to be used directly in the kernel code because
it's a helper for bitmap_weight(). Switch everything to bitmap_weight().
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Alexey reported that the fraction of unknown filename instances in
kallsyms grew from ~0.3% to ~10% recently; Bill and Greg tracked it down
to assembler defined symbols, which regressed as a result of:
commit b8a9092330 ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")
In that commit, I allude to restoring debug info for assembler defined
symbols in a follow up patch, but it seems I forgot to do so in
commit a66049e2cf ("Kbuild: make DWARF version a choice")
Link: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;h=31bf18645d98b4d3d7357353be840e320649a67d
Fixes: b8a9092330 ("Kbuild: do not emit debug info for assembly with LLVM_IAS=1")
Reported-by: Alexey Alexandrov <aalexand@google.com>
Reported-by: Bill Wendling <morbo@google.com>
Reported-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Suggested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
sg_alloc_table_chained() is called by several drivers, but if it is
called before sg_pool_init(), it results in a NULL pointer dereference
in sg_pool_alloc().
Since commit 9b1d6c8950 ("lib: scatterlist: move SG pool code from
SCSI driver to lib/sg_pool.c"), we rely on module_init(sg_pool_init)
is invoked before other module_init calls but this assumption is
fragile.
I slightly changed the link order while refactoring Kbuild, then
uncovered this issue. I should keep the current link order, but
depending on a specific call order among module_init is so fragile.
We usually define the init order by specifying *_initcall correctly,
or delay the driver probing by returning -EPROBE_DEFER.
Change module_initcall() to subsys_initcall(), and also delete the
pointless module_exit() because lib/sg_pool.c is always compiled as
built-in. (CONFIG_SG_POOL is bool)
Link: https://lore.kernel.org/all/20220921043946.GA1355561@roeck-us.net/
Link: https://lore.kernel.org/all/8e70837d-d859-dfb2-bf7f-83f8b31467bc@samsung.com/
Fixes: 9b1d6c8950 ("lib: scatterlist: move SG pool code from SCSI driver to lib/sg_pool.c")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Over the past couple years, the function _find_next_bit() was extended
with parameters that modify its behavior to implement and- zero- and le-
flavors. The parameters are passed at compile time, but current design
prevents a compiler from optimizing out the conditionals.
As find_next_bit() API grows, I expect that more parameters will be added.
Current design would require more conditional code in _find_next_bit(),
which would bloat the helper even more and make it barely readable.
This patch replaces _find_next_bit() with a macro FIND_NEXT_BIT, and adds
a set of wrappers, so that the compile-time optimizations become possible.
The common logic is moved to the new macro, and all flavors may be
generated by providing a FETCH macro parameter, like in this example:
#define FIND_NEXT_BIT(FETCH, MUNGE, size, start) ...
find_next_xornot_and_bit(addr1, addr2, addr3, size, start)
{
return FIND_NEXT_BIT(addr1[idx] ^ ~addr2[idx] & addr3[idx],
/* nop */, size, start);
}
The FETCH may be of any complexity, as soon as it only refers the bitmap(s)
and an iterator idx.
MUNGE is here to support _le code generation for BE builds. May be
empty.
I ran find_bit_benchmark 16 times on top of 6.0-rc2 and 16 times on top
of 6.0-rc2 + this series. The results for kvm/x86_64 are:
v6.0-rc2 Optimized Difference Z-score
Random dense bitmap ns ns ns %
find_next_bit: 787735 670546 117189 14.9 3.97
find_next_zero_bit: 777492 664208 113284 14.6 10.51
find_last_bit: 830925 687573 143352 17.3 2.35
find_first_bit: 3874366 3306635 567731 14.7 1.84
find_first_and_bit: 40677125 37739887 2937238 7.2 1.36
find_next_and_bit: 347865 304456 43409 12.5 1.35
Random sparse bitmap
find_next_bit: 19816 14021 5795 29.2 6.10
find_next_zero_bit: 1318901 1223794 95107 7.2 1.41
find_last_bit: 14573 13514 1059 7.3 6.92
find_first_bit: 1313321 1249024 64297 4.9 1.53
find_first_and_bit: 8921 8098 823 9.2 4.56
find_next_and_bit: 9796 7176 2620 26.7 5.39
Where the statistics is significant (z-score > 3), the improvement
is ~15%.
According to the bloat-o-meter, the Image size is 10-11K less:
x86_64/defconfig:
add/remove: 32/14 grow/shrink: 61/782 up/down: 6344/-16521 (-10177)
arm64/defconfig:
add/remove: 3/2 grow/shrink: 50/714 up/down: 608/-11556 (-10948)
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
find_first_zero_bit_le() is an alias to find_next_zero_bit_le(),
despite that 'next' is known to be slower than 'first' version.
Now that we have common FIND_FIRST_BIT() macro helper, it's trivial
to implement find_first_zero_bit_le() as a real function.
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Now that we have many flavors of find_first_bit(), and expect even more,
it's better to have one macro that generates optimal code for all and makes
maintaining of slightly different functions simpler.
The logic common to all versions is moved to the new macro, and all the
flavors are generated by providing an FETCH macro-parameter, like
in this example:
#define FIND_FIRST_BIT(FETCH, MUNGE, size) ...
find_first_ornot_and_bit(addr1, addr2, addr3, size)
{
return FIND_FIRST_BIT(addr1[idx] | ~addr2[idx] & addr3[idx], /* nop */, size);
}
The FETCH may be of any complexity, as soon as it only refers
the bitmap(s) and an iterator idx.
MUNGE is here to support _le code generation for BE builds. May be
empty.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
The size of cpumasks is hard-limited by compile-time parameter NR_CPUS,
but defined at boot-time when kernel parses ACPI/DT tables, and stored in
nr_cpu_ids. In many practical cases, number of CPUs for a target is known
at compile time, and can be provided with NR_CPUS.
In that case, compiler may be instructed to rely on NR_CPUS as on actual
number of CPUs, not an upper limit. It allows to optimize many cpumask
routines and significantly shrink size of the kernel image.
This patch adds FORCE_NR_CPUS option to teach the compiler to rely on
NR_CPUS and enable corresponding optimizations.
If FORCE_NR_CPUS=y, kernel will not set nr_cpu_ids at boot, but only check
that the actual number of possible CPUs is equal to NR_CPUS, and WARN if
that doesn't hold.
The new option is especially useful in embedded applications because
kernel configurations are unique for each SoC, the number of CPUs is
constant and known well, and memory limitations are typically harder.
For my 4-CPU ARM64 build with NR_CPUS=4, FORCE_NR_CPUS=y saves 46KB:
add/remove: 3/4 grow/shrink: 46/729 up/down: 652/-46952 (-46300)
Signed-off-by: Yury Norov <yury.norov@gmail.com>
The seqcount fprop_global::sequence is not associated with a lock. The
write section (fprop_new_period()) is invoked from a timer and since the
softirq is preemptible on PREEMPT_RT it is possible to preempt the write
section which is not desited.
Disable preemption around the write section on PREEMPT_RT.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220825164131.402717-8-bigeasy@linutronix.de
Some places in the VM code expect interrupts disabled, which is a valid
expectation on non-PREEMPT_RT kernels, but does not hold on RT kernels in
some places because the RT spinlock substitution does not disable
interrupts.
To avoid sprinkling CONFIG_PREEMPT_RT conditionals into those places,
provide VM_WARN_ON_IRQS_ENABLED() which is only enabled when VM_DEBUG=y and
PREEMPT_RT=n.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lore.kernel.org/r/20220825164131.402717-5-bigeasy@linutronix.de
A much better "unknown size" string pointer is available directly from
struct test, so use that instead of a global that isn't shared with
modules.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/lkml/YyCOHOchVuE/E7vS@dev-arch.thelio-3990X
Fixes: 875bfd5276 ("fortify: Add KUnit test for FORTIFY_SOURCE internals")
Cc: linux-hardening@vger.kernel.org
Build-tested-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmMeQ2keHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGYRMH+gLNHiGirGZlm2GQ
tKaZQUy7MiXuIP0hGDonDIIIAmIVhnjm9MDG8KT4W8AvEd7ukncyYqJfwWeWQPhP
4mZcf6l3Z8Ke+qiaFpXpMPCxTyWcln1ox0EoNx2g9gdPxZntaRuuaTQVljUfTiey
aVPHxve8ip3G7jDoJnuLSxESOqWxkb8v/SshBP1E5bF5BZ+cgZRqq7FNigFqxjbk
wF29K09BVOPjdgkSvY/b0/SnL5KlSdMAv+FrPcJNGivcdIPgf/qJks5cI2HRUo7o
CpKgbcLorCVyD+d+zLonJBwIy3arbmKD8JqYnfdTSIqVOUqHXWUDfeydsH32u1Gu
lPSI2Hw=
=7LTL
-----END PGP SIGNATURE-----
Merge 6.0-rc5 into driver-core-next
We need the driver core and debugfs changes in this branch.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
llist_add_batch and llist_del_first. x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg
fails, enabling further code simplifications.
No functional change intended.
Link: https://lkml.kernel.org/r/20220712144917.4497-1-ubizjak@gmail.com
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use atomic_long_try_cmpxchg instead of
atomic_long_cmpxchg (*ptr, old, new) == old in __sbitmap_queue_get_batch.
x86 CMPXCHG instruction returns success in ZF flag, so this change
saves a compare after cmpxchg (and related move instruction in front
of cmpxchg).
Also, atomic_long_cmpxchg implicitly assigns old *ptr value to "old"
when cmpxchg fails, enabling further code simplifications, e.g.
an extra memory read can be avoided in the loop.
No functional change intended.
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20220908151200.9993-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When __sbq_wake_up() decrements wait_cnt to 0 but races with someone
else waking the waiter on the waitqueue (so the waitqueue becomes
empty), it exits without reseting wait_cnt to wake_batch number. Once
wait_cnt is 0, nobody will ever reset the wait_cnt or wake the new
waiters resulting in possible deadlocks or busyloops. Fix the problem by
making sure we reset wait_cnt even if we didn't wake up anybody in the
end.
Fixes: 040b83fcec ("sbitmap: fix possible io hung due to lost wakeup")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220908130937.2795-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the definition of is_signed_type() has been moved from
<linux/overflow.h> to <linux/compiler.h>, include the latter header file
instead of the former. Additionally, add a test for the type 'char'.
Cc: Isabella Basso <isabbasso@riseup.net>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220907180329.3825417-1-bvanassche@acm.org
One of the "legitimate" uses of strncpy() is copying a NUL-terminated
string into a fixed-size non-NUL-terminated character array. To avoid
the weaknesses and ambiguity of intent when using strncpy(), provide
replacement functions that explicitly distinguish between trailing
padding and not, and require the destination buffer size be discoverable
by the compiler.
For example:
struct obj {
int foo;
char small[4] __nonstring;
char big[8] __nonstring;
int bar;
};
struct obj p;
/* This will truncate to 4 chars with no trailing NUL */
strncpy(p.small, "hello", sizeof(p.small));
/* p.small contains 'h', 'e', 'l', 'l' */
/* This will NUL pad to 8 chars. */
strncpy(p.big, "hello", sizeof(p.big));
/* p.big contains 'h', 'e', 'l', 'l', 'o', '\0', '\0', '\0' */
When the "__nonstring" attributes are missing, the intent of the
programmer becomes ambiguous for whether the lack of a trailing NUL
in the p.small copy is a bug. Additionally, it's not clear whether
the trailing padding in the p.big copy is _needed_. Both cases
become unambiguous with:
strtomem(p.small, "hello");
strtomem_pad(p.big, "hello", 0);
See also https://github.com/KSPP/linux/issues/90
Expand the memcpy KUnit tests to include these functions.
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Under some pathological 32-bit configs, the shift overflow KUnit tests
create huge stack frames. Split up the function to avoid this,
separating by rough shift overflow cases.
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Daniel Latypov <dlatypov@google.com>
Cc: Vitor Massaru Iha <vitor@massaru.org>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/lkml/202208301850.iuv9VwA8-lkp@intel.com
Acked-by: Daniel Latypov <dlatypov@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
When the check_[op]_overflow() helpers were introduced, all arguments
were required to be the same type to make the fallback macros simpler.
However, now that the fallback macros have been removed[1], it is fine
to allow mixed types, which makes using the helpers much more useful,
as they can be used to test for type-based overflows (e.g. adding two
large ints but storing into a u8), as would be handy in the drm core[2].
Remove the restriction, and add additional self-tests that exercise
some of the mixed-type overflow cases, and double-check for accidental
macro side-effects.
[1] https://git.kernel.org/linus/4eb6bd55cfb22ffc20652732340c4962f3ac9a91
[2] https://lore.kernel.org/lkml/20220824084514.2261614-2-gwan-gyeong.mun@intel.com
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: linux-hardening@vger.kernel.org
Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Tested-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Demonstrate use of DECLARE_DYNDBG_CLASSMAP macro, and expose them as
sysfs-nodes for testing.
For each of the 4 class-map-types:
- declare a class-map of that type,
- declare the enum corresponding to those class-names
- share _base across 0..30 range
- add a __pr_debug_cls() call for each class-name
- declare 2 sysnodes for each class-map
for 'p' flag, and future 'T' flag
These declarations create the following sysfs parameter interface:
:#> pwd
/sys/module/test_dynamic_debug/parameters
:#> ls
T_disjoint_bits T_disjoint_names T_level_names T_level_num do_prints
p_disjoint_bits p_disjoint_names p_level_names p_level_num
NOTES:
The local wrapper macro is an api candidate, but there are already too
many parameters. OTOH, maybe related enum should be in there too,
since it has _base inter-dependencies.
The T_* params control the (future) T flag on the same class'd
pr_debug callsites as their p* counterparts. Using them will fail,
until the dyndbg-trace patches are added in.
:#> echo 1 > T_disjoint
[ 28.792489] dyndbg: disjoint: 0x1 > test_dynamic_debug.T_D2
[ 28.793848] dyndbg: query 0: "class D2_CORE +T" mod:*
[ 28.795086] dyndbg: split into words: "class" "D2_CORE" "+T"
[ 28.796467] dyndbg: op='+'
[ 28.797148] dyndbg: unknown flag 'T'
[ 28.798021] dyndbg: flags parse failed
[ 28.798947] dyndbg: processed 1 queries, with 0 matches, 1 errs
[ 28.800378] dyndbg: bit_0: -22 matches on class: D2_CORE -> 0x1
[ 28.801959] dyndbg: test_dynamic_debug.T_D2: updated 0x0 -> 0x1
[ 28.803974] dyndbg: total matches: -22
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-22-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add kernel_param_ops and callbacks to use a class-map to validate and
apply input to a sysfs-node, which allows users to control classes
defined in that class-map. This supports uses like:
echo 0x3 > /sys/module/drm/parameters/debug
IE add these:
- int param_set_dyndbg_classes()
- int param_get_dyndbg_classes()
- struct kernel_param_ops param_ops_dyndbg_classes
Following the model of kernel/params.c STANDARD_PARAM_DEFS, these are
non-static and exported. This might be unnecessary here.
get/set use an augmented kernel_param; the arg refs a new struct
ddebug_class_param, which contains:
- A ptr to user's state-store; a union of &ulong for drm.debug, &int
for nouveau level debug. By ref'g the client's bit-state _var, code
coordinates with existing code (like drm_debug_enabled) which uses
it, so existing/remaining calls can work unchanged. Changing
drm.debug to a ulong allows use of BIT() etc.
- FLAGS: dyndbg.flags toggled by changes to bitmap. Usually just "p".
- MAP: a pointer to struct ddebug_classes_map, which maps those
class-names to .class_ids 0..N that the module is using. This
class-map is declared & initialized by DECLARE_DYNDBG_CLASSMAP.
- map-type: 4 enums DD_CLASS_TYPE_* select 2 input forms and 2 meanings.
numeric input:
DD_CLASS_TYPE_DISJOINT_BITS integer input, independent bits. ie: drm.debug
DD_CLASS_TYPE_LEVEL_NUM integer input, 0..N levels
classnames-list (comma separated) input:
DD_CLASS_TYPE_DISJOINT_NAMES each name affects a bit, others preserved
DD_CLASS_TYPE_LEVEL_NAMES names have level meanings, like kern_levels.h
_NAMES - comma-separated classnames (with optional +-)
_NUM - numeric input, 0-N expected
_BITS - numeric input, 0x1F bitmap form expected
_DISJOINT - bits are independent
_LEVEL - (x<y) on bit-pos.
_DISJOINT treats input like a bit-vector (ala drm.debug), and sets
each bit accordingly. LEVEL is layered on top of this.
_LEVEL treats input like a bit-pos:N, then sets bits(0..N)=1, and
bits(N+1..max)=0. This applies (bit<N) semantics on top of disjoint
bits.
USAGES:
A potentially typical _DISJOINT_NAMES use:
echo +DRM_UT_CORE,+DRM_UT_KMS,-DRM_UT_DRIVER,-DRM_UT_ATOMIC \
> /sys/module/drm/parameters/debug_catnames
A naive _LEVEL_NAMES use, with one class, that sets all in the
class-map according to (x<y):
: problem seen
echo +L7 > /sys/module/test_dynamic_debug/parameters/p_level_names
: problem solved
echo -L1 > /sys/module/test_dynamic_debug/parameters/p_level_names
Note this artifact:
: this is same as prev cmd (due to +/-)
echo L0 > /sys/module/test_dynamic_debug/parameters/p_level_names
: this is "even-more" off, but same wo __pr_debug_class(L0, "..").
echo -L0 > /sys/module/test_dynamic_debug/parameters/p_level_names
A stress-test/make-work usage (kid toggling a light switch):
echo +L7,L0,L7,L0,L7,L0,L7,L0,L7,L0,L7,L0,L7 \
> /sys/module/test_dynamic_debug/parameters/p_level_names
ddebug_apply_class_bitmap(): inside-fn, works on bitmaps, receives
new-bits, finds diffs vs client-bitvector holding "current" state,
and issues exec_query to commit the adjustment.
param_set_dyndbg_classes(): interface fn, sends _NAMES to
param_set_dyndbg_classnames() and returns, falls thru to handle _BITS,
_NUM internally, and calls ddebug_apply_class_bitmap(). Finishes by
updating state.
param_set_dyndbg_classnames(): handles classnames-list in loop, calls
ddebug_apply_class_bitmap for each, then updates state.
NOTES:
_LEVEL_ is overlay on _DISJOINT_; inputs are converted to a bitmask,
by the callbacks. IOW this is possible, and possibly confusing:
echo class V3 +p > control
echo class V1 -p > control
IMO thats ok, relative verbosity is an interface property.
_LEVEL_NUM maps still need class-names, even though the names are not
usable at the sysfs interface (unlike with _NAMES style). The names
are the only way to >control the classes.
- It must have a "V0" name,
something below "V1" to turn "V1" off.
__pr_debug_cls(V0,..) is printk, don't do that.
- "class names" is required at the >control interface.
- relative levels are not enforced at >control
_LEVEL_NAMES bear +/- signs, which alters the on-bit-pos by 1. IOW,
+L2 means L0,L1,L2, and -L2 means just L0,L1. This kinda spoils the
readback fidelity, since the L0 bit gets turned on by any use of any
L*, except "-L0".
All the interface uncertainty here pertains to the _NAMES features.
Nobody has actually asked for this, so its practical (if a little
tedious) to split it out.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-21-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add module-to-class validation:
#> echo class DRM_UT_KMS +p > /proc/dynamic_debug/control
If a query has "class FOO", then ddebug_find_valid_class(), called
from ddebug_change(), requires that FOO is known to module X,
otherwize the query is skipped entirely for X. This protects each
module's class-space, other than the default:31.
The authors' choice of FOO is highly selective, giving isolation
and/or coordinated sharing of FOOs. For example, only DRM modules
should know and respond to DRM_UT_KMS.
So this, combined with module's opt-in declaration of known classes,
effectively privatizes the .class_id space for each module (or
coordinated set of modules).
Notes:
For all "class FOO" queries, ddebug_find_valid_class() is called, it
returns the map matching the query, and sets valid_class via an
*outvar).
If no "class FOO" is supplied, valid_class = _CLASS_DFLT. This
insures that legacy queries do not trample on new class'd callsites,
as they get added.
Also add a new column to control-file output, displaying non-default
class-name (when found) or the "unknown _id:", if it has not been
(correctly) declared with one of the declarator macros.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-18-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add ddebug_attach_module_classes(), call it from ddebug_add_module().
It scans the classes/section its given, finds records where the
module-name matches the module being added, and adds them to the
module's maps list. No locking here, since the record
isn't yet linked into the ddebug_tables list.
It is called indirectly from 2 sources:
- from load_module(), where it scans the module's __dyndbg_classes
section, which contains DYNAMIC_DEBUG_CLASSES definitions from just
the module.
- from dynamic_debug_init(), where all DYNAMIC_DEBUG_CLASSES
definitions of each builtin module have been packed together.
This is why ddebug_attach_module_classes() checks module-name.
NOTES
Its (highly) likely that builtin classes will be ordered by module
name (just like prdbg descriptors are in the __dyndbg section). So
the list can be replaced by a vector (ptr + length), which will work
for loaded modules too. This would imitate whats currently done for
the _ddebug descriptors.
That said, converting to vector,len is close to pointless; a small
minority of modules will ever define a class-map, and almost all of
them will have only 1 or 2 class-maps, so theres only a couple dozen
pointers to save. TODO: re-evaluate for lines removable.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-17-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add __dyndbg_classes section, using __dyndbg as a model. Use it:
vmlinux.lds.h:
KEEP the new section, which also silences orphan section warning on
loadable modules. Add (__start_/__stop_)__dyndbg_classes linker
symbols for the c externs (below).
kernel/module/main.c:
- fill new fields in find_module_sections(), using section_objs()
- extend callchain prototypes
to pass classes, length
load_module(): pass new info to dynamic_debug_setup()
dynamic_debug_setup(): new params, pass through to ddebug_add_module()
dynamic_debug.c:
- add externs to the linker symbols.
ddebug_add_module():
- It currently builds a debug_table, and *will* find and attach classes.
dynamic_debug_init():
- add class fields to the _ddebug_info cursor var: di.
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-16-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This new struct composes the linker provided (vector,len) section,
and provides a place to add other __dyndbg[] state-data later:
descs - the vector of descriptors in __dyndbg section.
num_descs - length of the data/section.
Use it, in several different ways, as follows:
In lib/dynamic_debug.c:
ddebug_add_module(): Alter params-list, replacing 2 args (array,index)
with a struct _ddebug_info * containing them both, with room for
expansion. This helps future-proof the function prototype against the
looming addition of class-map info into the dyndbg-state, by providing
a place to add more member fields later.
NB: later add static struct _ddebug_info builtins_state declaration,
not needed yet.
ddebug_add_module() is called in 2 contexts:
In dynamic_debug_init(), declare, init a struct _ddebug_info di
auto-var to use as a cursor. Then iterate over the prdbg blocks of
the builtin modules, and update the di cursor before calling
_add_module for each.
Its called from kernel/module/main.c:load_info() for each loaded
module:
In internal.h, alter struct load_info, replacing the dyndbg array,len
fields with an embedded _ddebug_info containing them both; and
populate its members in find_module_sections().
The 2 calling contexts differ in that _init deals with contiguous
subranges of __dyndbgs[] section, packed together, while loadable
modules are added one at a time.
So rename ddebug_add_module() into outer/__inner fns, call __inner
from _init, and provide the offset into the builtin __dyndbgs[] where
the module's prdbgs reside. The cursor provides start, len of the
subrange for each. The offset will be used later to pack the results
of builtin __dyndbg_sites[] de-duplication, and is 0 and unneeded for
loadable modules,
Note:
kernel/module/main.c includes <dynamic_debug.h> for struct
_ddeubg_info. This might be prone to include loops, since its also
included by printk.h. Nothing has broken in robot-land on this.
cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-12-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rework var-names for clarity, regularity
rename variables
- n to mod_sites - it counts sites-per-module
- entries to i - display only
- iter_start to iter_mod_start - marks start of each module's subrange
- modct to mod_ct - stylistic
new iterator var:
- site - cursor parallel to iter
1st step towards 'demotion' of iter->site, for removal later
treat vars as iters:
- drop init at top
init just above for-loop, in a textual block
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-11-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This exported fn is unused, and will not be needed. Lets dump it.
The export was added to let drm control pr_debugs, as part of using
them to avoid drm_debug_enabled overheads. But its better to just
implement the drm.debug bitmap interface, then its available for
everyone.
Fixes: a2d375eda7 ("dyndbg: refine export, rename to dynamic_debug_exec_queries()")
Fixes: 4c0d77828d ("dyndbg: export ddebug_exec_queries")
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-10-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dyndbg's control-parser: ddebug_parse_query(), requires that search
terms: module, func, file, lineno, are used only once in a query; a
thing cannot be named both foo and bar.
The cited commit added an overriding module modname, taken from the
module loader, which is authoritative. So it set query.module 1st,
which disallowed its use in the query-string.
But now, its useful to allow a module-load to enable classes across a
whole (or part of) a subsystem at once.
# enable (dynamic-debug in) drm only
modprobe drm dyndbg="class DRM_UT_CORE +p"
# get drm_helper too
modprobe drm dyndbg="class DRM_UT_CORE module drm* +p"
# get everything that knows DRM_UT_CORE
modprobe drm dyndbg="class DRM_UT_CORE module * +p"
# also for boot-args:
drm.dyndbg="class DRM_UT_CORE module * +p"
So convert the override into a default, by filling it only when/after
the query-string omitted the module.
NB: the query class FOO handling is forthcoming.
Fixes: 8e59b5cfb9 dynamic_debug: add modname arg to exec_query callchain
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-8-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Walk the module's vector of callsites backwards; ie N..0. This
"corrects" the backwards appearance of a module's prdbg vector when
walked 0..N. I think this is due to linker mechanics, which I'm
inclined to treat as immutable, and the order is fixable in display.
No functional changes.
Combined with previous commit, which reversed tables-list, we get:
:#> head -n7 /proc/dynamic_debug/control
# filename:lineno [module]function flags format
init/main.c:1179 [main]initcall_blacklist =_ "blacklisting initcall %s\012"
init/main.c:1218 [main]initcall_blacklisted =_ "initcall %s blacklisted\012"
init/main.c:1424 [main]run_init_process =_ " with arguments:\012"
init/main.c:1426 [main]run_init_process =_ " %s\012"
init/main.c:1427 [main]run_init_process =_ " with environment:\012"
init/main.c:1429 [main]run_init_process =_ " %s\012"
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-6-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
/proc/dynamic_debug/control walks the prdbg catalog in "reverse",
fix this by adding new ddebug_tables to tail of list.
This puts init/main.c entries 1st, which looks more than coincidental.
no functional changes.
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-5-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
print "old => new" flag values to the info("change") message.
no functional change.
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-4-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In https://lore.kernel.org/lkml/20211209150910.GA23668@axis.com/
Vincent's patch commented on, and worked around, a bug toggling
static_branch's, when a 2nd PRINTK-ish flag was added. The bug
results in a premature static_branch_disable when the 1st of 2 flags
was disabled.
The cited commit computed newflags, but then in the JUMP_LABEL block,
failed to use that result, instead using just one of the terms in it.
Using newflags instead made the code work properly.
This is Vincents test-case, reduced. It needs the 2nd flag to
demonstrate the bug, but it's explanatory here.
pt_test() {
echo 5 > /sys/module/dynamic_debug/verbose
site="module tcp" # just one callsite
echo " $site =_ " > /proc/dynamic_debug/control # clear it
# A B ~A ~B
for flg in +T +p "-T #broke here" -p; do
echo " $site $flg " > /proc/dynamic_debug/control
done;
# A B ~B ~A
for flg in +T +p "-p #broke here" -T; do
echo " $site $flg " > /proc/dynamic_debug/control
done
}
pt_test
Fixes: 84da83a6ff dyndbg: combine flags & mask into a struct, simplify with it
CC: vincent.whitchurch@axis.com
Acked-by: Jason Baron <jbaron@akamai.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jim Cromie <jim.cromie@gmail.com>
Link: https://lore.kernel.org/r/20220904214134.408619-2-jim.cromie@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
netlink allows to specify allowed ranges for integer types.
Unfortunately, nfnetlink passes integers in big endian, so the existing
NLA_POLICY_MAX() cannot be used.
At the moment, nfnetlink users, such as nf_tables, need to resort to
programmatic checking via helpers such as nft_parse_u32_check().
This is both cumbersome and error prone. This adds NLA_POLICY_MAX_BE
which adds range check support for BE16, BE32 and BE64 integers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new helper function to allow for splitting specified user string
into a sequence of integers. Internally it makes use of get_options() so
the returned sequence contains the integers extracted plus an additional
element that begins the sequence and specifies the integers count.
Suggested-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Cezary Rojewski <cezary.rojewski@intel.com>
Link: https://lore.kernel.org/r/20220904102840.862395-2-cezary.rojewski@intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
This reverts commit 16ede66973.
This is causing issues with CPU stalls on my test box, revert it for
now until we understand what is going on. It looks like infinite
looping off sbitmap_queue_wake_up(), but hard to tell with a lot of
CPUs hitting this issue and the console scrolling infinitely.
Link: https://lore.kernel.org/linux-block/e742813b-ce5c-0d58-205b-1626f639b1bd@kernel.dk/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
devm_ioremap_np has never been used anywhere since it was added in early
2021, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220822061424.151819-1-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
console_unblank() does this too (called in both places right after),
and with a lot more confidence inspiring approach to locking.
Reconstructing this story is very strange:
In b61312d353 ("oops handling: ensure that any oops is flushed to
the mtdoops console") it is claimed that a printk(" "); flushed out
the console buffer, which was removed in e3e8a75d2a ("[PATCH]
Extract and use wake_up_klogd()"). In todays kernels this is done way
earlier in console_flush_on_panic with some really nasty tricks. I
didn't bother to fully reconstruct this all, least because the call to
bust_spinlock(0); gets moved every few years, depending upon how the
wind blows (or well, who screamed loudest about the various issue each
call site caused).
Before that commit the only calls to console_unblank() where in s390
arch code.
The other side here is the console->unblank callback, which was
introduced in 2.1.31 for the vt driver. Which predates the
console_unblank() function by a lot, which was added (without users)
in 2.4.14.3. So pretty much impossible to guess at any motivation
here. Also afaict the vt driver is the only (and always was the only)
console driver implementing the unblank callback, so no idea why a
call to console_unblank() was added for the mtdooops driver - the
action actually flushing out the console buffers is done from
console_unlock() only.
Note that as prep for the s390 users the locking was adjusted in
2.5.22 (I couldn't figure out how to properly reference the BK commit
from the historical git trees) from a normal semaphore to a trylock.
Note that a copy of the direct unblank_screen() call was added to
panic() in c7c3f05e34 ("panic: avoid deadlocks in re-entrant console
drivers"), which partially inlined the bust_spinlocks(0); call.
Long story short, I have no idea why the direct call to unblank_screen
survived for so long (the infrastructure to do it properly existed for
years), nor why it wasn't removed when the console_unblank() call was
finally added. But it makes a ton more sense to finally do that than
not - it's just better encapsulation to go through the console
functions instead of doing a direct call, so let's dare. Plus it
really does not make much sense to call the only unblank
implementation there is twice, once without, and once with appropriate
locking.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Xuezhi Zhang <zhangxuezhi1@coolpad.com>
Cc: Yangxi Xiang <xyangxi5@gmail.com>
Cc: nick black <dankamongmen@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: Marco Elver <elver@google.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: David Gow <davidgow@google.com>
Cc: tangmeng <tangmeng@uniontech.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20220830145004.430545-1-daniel.vetter@ffwll.ch
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dependency on XOR_BLOCKS.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmMIp9QACgkQxycdCkmx
i6eYhRAAs4XrSzpSSWaTaf3JZuQIqqs9KoABLvxeYfemEwgnd28NLQrea2qnM5yW
NT5TSL/OYpTmTTD9Q/uHm9TBpRchuQMto7rDLwB129ie1WdeSqwOKjk9zMb7SzkL
5mgO2L05jkyvo4NI4KxUyeBtMNdZc8eu4/iAhOejdnZYIgxFQUAuNVmFcy1VG/tq
kFOOLaXLOtrs9y+IswyoLoP7LdLbYCtUO8B4wBzMKZmRHXRKYtpXAnz/ytUhfSEM
JL6Vrb4jwAUK29B8A6Nk49gl5CHSlXOhnTrv6RK2qhwpLXQxsWWrDRinsJ6xQJk0
3pWVRDm2MAhpqsuk6Q/Cj0pFpFSiLCPRPSiH0Du2w6W3Fm8KGcQoE87bYIgBSTX/
vpBdGU0ItVokMB0OqNIHOLUFyi4wMM20wOKt5znVLeVlY0adye6SdcBGm/qXurR4
midv7HNyBZkjh/H9wtC//hdkpQDq8l87ygL/F44IBFlXItUFt+Vrhoi61m8AGSVb
5nkGwSQMe/VXBKV3gpBEM7LzVTwUsw0poG73rxFBByrAzVVd9Y9F4FXJ5xUmcRvf
uYl95Zw6vk5IBu/ijdqaJPyHBXLqVWg7FyAlZjtJxCIO4gIBMl/KCNY8yHE6f8Nd
6kpvL2GmFvYmpnu04ay+tFAEB0KgbzDZSNrZissZN0ilMwvykzM=
=dIEI
-----END PGP SIGNATURE-----
Merge tag 'v6.0-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"Fix a boot performance regression due to an unnecessary dependency on
XOR_BLOCKS"
* tag 'v6.0-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: lib - remove unneeded selection of XOR_BLOCKS
Add KUnit test for hw_breakpoint constraints accounting, with various
interesting mixes of breakpoint targets (some care was taken to catch
interesting corner cases via bug-injection).
The test cannot be built as a module because it requires access to
hw_breakpoint_slots(), which is not inlinable or exported on all
architectures.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20220829124719.675715-2-elver@google.com
Hi Linus,
Please pull (hopefully) the last portion of fixes from Sander for his
UP rework series. The original series came from -mm tree, and it was
not the latest version, that's why we need follow-ups. It fixes only
a test introduced by that series. The test fails under certain configs.
From Sander:
This series fixes the reported issues, and implements the suggested
improvements, for the version of the cpumask tests [1] that was merged
with commit c41e8866c2 ("lib/test: introduce cpumask KUnit test
suite").
These changes include fixes for the tests, and better alignment with the
KUnit style guidelines.
Thanks,
Yury
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmMKYcoACgkQsUSA/Tof
vsiceAv/TY+HTn1gmrNQwi7xC6VUD7mFYlVNZMtyMpZ23UYildz5SjFfuQV3UbXI
H5yKgSao9VFsbwyDUXbhySgOaNR8auq17Ey3jSJuR2A76qO2u2d79Gdt4IjIkq5N
IGOPv/pNOur7J+KSbiVhXasFeZGJ6Xi+xAobp5CK1uPCUI3oU1pAcm1iKkI+eWZ3
tPsM3aWcYGCDec7tqtqcsiWO2x9imPnrpI+C91Pwwr+N40ObkMc4IPzuPrQRn2T2
ECY9pgIWKOwOJ41jzgCVwZIHmuOn9dEgmaEGvE9Ah57OwuDlS43M4Ok3xy2+xS3t
3naLG3p02sJy7sXabC+xH4VJVPNT9/qauMW27cntPeeI2i/+yZXuQSLlVOllrY7/
LYxI8lVb1j50A90I/WrwXoDV0E68cfjhkiqhkgV33t1EamhSJvTG8GwCnF46WG8o
LzLukvoohA9uIrPAH2YpkZtrvsuT6iQccCY0M+kXv6TuYTgygdE16muVHffDKvsG
EIVdBGu6
=oNmV
-----END PGP SIGNATURE-----
Merge tag 'bitmap-6.0-rc3' of github.com:/norov/linux
Pull bitmap fixes from Yury Norov:
"Fix the reported issues, and implements the suggested improvements,
for the version of the cpumask tests [1] that was merged with commit
c41e8866c2 ("lib/test: introduce cpumask KUnit test suite").
These changes include fixes for the tests, and better alignment with
the KUnit style guidelines"
* tag 'bitmap-6.0-rc3' of github.com:/norov/linux:
lib/cpumask_kunit: add tests file to MAINTAINERS
lib/cpumask_kunit: log mask contents
lib/test_cpumask: follow KUnit style guidelines
lib/test_cpumask: fix cpu_possible_mask last test
lib/test_cpumask: drop cpu_possible_mask full test
CRYPTO_LIB_CHACHA_GENERIC doesn't need to select XOR_BLOCKS. It perhaps
was thought that it's needed for __crypto_xor, but that's not the case.
Enabling XOR_BLOCKS is problematic because the XOR_BLOCKS code runs a
benchmark when it is initialized. That causes a boot time regression on
systems that didn't have it enabled before.
Therefore, remove this unnecessary and problematic selection.
Fixes: e56e189855 ("lib/crypto: add prompts back to crypto libraries")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Current release - new code bugs:
- dsa: don't dereference NULL extack in dsa_slave_changeupper()
- dpaa: fix <1G ethernet on LS1046ARDB
- neigh: don't call kfree_skb() under spin_lock_irqsave()
Previous releases - regressions:
- r8152: fix the RX FIFO settings when suspending
- dsa: microchip: keep compatibility with device tree blobs with
no phy-mode
- Revert "net: macsec: update SCI upon MAC address change."
- Revert "xfrm: update SA curlft.use_time", comply with RFC 2367
Previous releases - always broken:
- netfilter: conntrack: work around exceeded TCP receive window
- ipsec: fix a null pointer dereference of dst->dev on a metadata
dst in xfrm_lookup_with_ifid
- moxa: get rid of asymmetry in DMA mapping/unmapping
- dsa: microchip: make learning configurable and keep it off
while standalone
- ice: xsk: prohibit usage of non-balanced queue id
- rxrpc: fix locking in rxrpc's sendmsg
Misc:
- another chunk of sysctl data race silencing
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmMH1scACgkQMUZtbf5S
IrtzTA//as5jbKepxBLqWjmDtTXTzkR9AZwD3pz/y2eRYYZz97N5R6TYLXh03zc0
OoB7yNIsjOtYu0aB0KosF+mqeGSzIG8MZ5W6eecQVRhUL270OD/kJ0G89CeHyuKP
BYUQE2S8z+55qM6IQ0DKbR4F038J2OeR6HdV7VUDFYRGfxDZsTZU4q3aY5bklAuz
TvpDAEsw0818a2lTdgqFUeRwbcU8ZIAJhiE/LQmqxhjsGyPkK02907Ccn06IrcAy
UHRBc6Cbjn8IcNNSL0hChjAkUdHtk7iHAqU8Nr2QnxKbE0FHGVOW8BsmY5GYvLAC
hH7t/dJAu3WUxubImZG6rnp3YD3YNZoaJrDgg6jSCJeUL6MKO2rJf8Q5HGiTJOWH
8vyPfCrB9IQVnef6Im0u9EFTyu9+W4MGVN4hyhttv2OykZwSQfdpjceGZgELiwSC
+od2p8TSXkZix//cTdWeO5THSnpHeMudh+0DEm10Uzf4+ybqIVuPn2ZCSy6piYJX
nsAIac1j7onWEyKQQ/nqy0o6rlZwLe+h0BraHHp3sApWVjyFwS4p6Z6VADed4kga
n/BsINdIW56pBT2nSrBTG5/RirlVfUTOaqiry0t6oak2qooEs0Gmm8DEbgTkncbs
BRLZTVzn6X3XWq52SXf7/v36xEJ/LRooY7MqUEMPg4emgGoNuC4=
=azH5
-----END PGP SIGNATURE-----
Merge tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from ipsec and netfilter (with one broken Fixes tag).
Current release - new code bugs:
- dsa: don't dereference NULL extack in dsa_slave_changeupper()
- dpaa: fix <1G ethernet on LS1046ARDB
- neigh: don't call kfree_skb() under spin_lock_irqsave()
Previous releases - regressions:
- r8152: fix the RX FIFO settings when suspending
- dsa: microchip: keep compatibility with device tree blobs with no
phy-mode
- Revert "net: macsec: update SCI upon MAC address change."
- Revert "xfrm: update SA curlft.use_time", comply with RFC 2367
Previous releases - always broken:
- netfilter: conntrack: work around exceeded TCP receive window
- ipsec: fix a null pointer dereference of dst->dev on a metadata dst
in xfrm_lookup_with_ifid
- moxa: get rid of asymmetry in DMA mapping/unmapping
- dsa: microchip: make learning configurable and keep it off while
standalone
- ice: xsk: prohibit usage of non-balanced queue id
- rxrpc: fix locking in rxrpc's sendmsg
Misc:
- another chunk of sysctl data race silencing"
* tag 'net-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
net: lantiq_xrx200: restore buffer if memory allocation failed
net: lantiq_xrx200: fix lock under memory pressure
net: lantiq_xrx200: confirm skb is allocated before using
net: stmmac: work around sporadic tx issue on link-up
ionic: VF initial random MAC address if no assigned mac
ionic: fix up issues with handling EAGAIN on FW cmds
ionic: clear broken state on generation change
rxrpc: Fix locking in rxrpc's sendmsg
net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2
MAINTAINERS: rectify file entry in BONDING DRIVER
i40e: Fix incorrect address type for IPv6 flow rules
ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
net: Fix a data-race around sysctl_somaxconn.
net: Fix a data-race around netdev_unregister_timeout_secs.
net: Fix a data-race around gro_normal_batch.
net: Fix data-races around sysctl_devconf_inherit_init_net.
net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
net: Fix a data-race around netdev_budget_usecs.
net: Fix data-races around sysctl_max_skb_frags.
net: Fix a data-race around netdev_budget.
...
There is no modification for param bitmap in function
bitmap_string() and bitmap_list_string(), so add const
modifier for it.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220816144557.30779-1-huangguangbin2@huawei.com
The cpumask test suite doesn't follow the KUnit style guidelines, as
laid out in Documentation/dev-tools/kunit/style.rst. The file is
renamed to lib/cpumask_kunit.c to clearly distinguish it from other,
non-KUnit, tests.
Link: https://lore.kernel.org/lkml/346cb279-8e75-24b0-7d12-9803f2b41c73@riseup.net/
Suggested-by: Maíra Canal <mairacanal@riseup.net>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Reviewed-by: Maíra Canal <mairacanal@riseup.net>
Reviewed-by: David Gow <davidgow@google.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Since cpumask_first() on the cpu_possible_mask must return at most
nr_cpu_ids - 1 for a valid result, cpumask_last() cannot return anything
larger than this value. As test_cpumask_weight() also verifies that the
total weight of cpu_possible_mask must equal nr_cpu_ids, the last bit
set in this mask must be at nr_cpu_ids - 1.
Fixes: c41e8866c2 ("lib/test: introduce cpumask KUnit test suite")
Link: https://lore.kernel.org/lkml/346cb279-8e75-24b0-7d12-9803f2b41c73@riseup.net/
Reported-by: Maíra Canal <mairacanal@riseup.net>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Tested-by: Maíra Canal <mairacanal@riseup.net>
Reviewed-by: David Gow <davidgow@google.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
When the number of CPUs that can possibly be brought online is known at
boot time, e.g. when HOTPLUG is disabled, nr_cpu_ids may be smaller than
NR_CPUS. In that case, cpu_possible_mask would not be completely filled,
and cpumask_full(cpu_possible_mask) can return false for valid system
configurations.
Without this test, cpu_possible_mask contents are still constrained by
a check on cpumask_weight(), as well as tests in test_cpumask_first(),
test_cpumask_last(), test_cpumask_next(), and test_cpumask_iterators().
Fixes: c41e8866c2 ("lib/test: introduce cpumask KUnit test suite")
Link: https://lore.kernel.org/lkml/346cb279-8e75-24b0-7d12-9803f2b41c73@riseup.net/
Reported-by: Maíra Canal <mairacanal@riseup.net>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Tested-by: Maíra Canal <mairacanal@riseup.net>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
While reading rs->interval and rs->burst, they can be changed
concurrently via sysctl (e.g. net_ratelimit_state). Thus, we
need to add READ_ONCE() to their readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two problems can lead to lost wakeup:
1) invalid wakeup on the wrong waitqueue:
For example, 2 * wake_batch tags are put, while only wake_batch threads
are woken:
__sbq_wake_up
atomic_cmpxchg -> reset wait_cnt
__sbq_wake_up -> decrease wait_cnt
...
__sbq_wake_up -> wait_cnt is decreased to 0 again
atomic_cmpxchg
sbq_index_atomic_inc -> increase wake_index
wake_up_nr -> wake up and waitqueue might be empty
sbq_index_atomic_inc -> increase again, one waitqueue is skipped
wake_up_nr -> invalid wake up because old wakequeue might be empty
To fix the problem, increasing 'wake_index' before resetting 'wait_cnt'.
2) 'wait_cnt' can be decreased while waitqueue is empty
As pointed out by Jan Kara, following race is possible:
CPU1 CPU2
__sbq_wake_up __sbq_wake_up
sbq_wake_ptr() sbq_wake_ptr() -> the same
wait_cnt = atomic_dec_return()
/* decreased to 0 */
sbq_index_atomic_inc()
/* move to next waitqueue */
atomic_set()
/* reset wait_cnt */
wake_up_nr()
/* wake up on the old waitqueue */
wait_cnt = atomic_dec_return()
/*
* decrease wait_cnt in the old
* waitqueue, while it can be
* empty.
*/
Fix the problem by waking up before updating 'wake_index' and
'wait_cnt'.
With this patch, noted that 'wait_cnt' is still decreased in the old
empty waitqueue, however, the wakeup is redirected to a active waitqueue,
and the extra decrement on the old empty waitqueue is not handled.
Fixes: 88459642cb ("blk-mq: abstract tag allocation out into sbitmap library")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220803121504.212071-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No architecture actually defines this, so it's unneeded.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
CRYPTO_LIB_CHACHA depends on CRYPTO for __crypto_xor, defined in
crypto/algapi.c. This is a layering violation because the dependencies
should only go in the other direction (crypto/ => lib/crypto/). Also
the correct dependency would be CRYPTO_ALGAPI, not CRYPTO. Fix this by
moving __crypto_xor into the utils module in lib/crypto/.
Note that CRYPTO_LIB_CHACHA_GENERIC selected XOR_BLOCKS, which is
unrelated and unnecessary. It was perhaps thought that XOR_BLOCKS was
needed for __crypto_xor, but that's not the case.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As requested at
https://lore.kernel.org/r/YtEgzHuuMts0YBCz@gondor.apana.org.au, move
__crypto_memneq into lib/crypto/ and put it under a new tristate. The
tristate is CRYPTO_LIB_UTILS, and it builds a module libcryptoutils. As
more crypto library utilities are being added, this creates a single
place for them to go without cluttering up the main lib directory.
The module's main file will be lib/crypto/utils.c. However, leave
memneq.c as its own file because of its nonstandard license.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since lib/cpumask.o is only built for CONFIG_SMP=y, NR_CPUS will always
be greater than 1 at compile time. This makes checking for that
condition unnecesarry, so it can be dropped.
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
In the uniprocessor case, cpumask_next_wrap() can be simplified, as the
number of valid argument combinations is limited:
- 'start' can only be 0
- 'n' can only be -1 or 0
The only valid CPU that can then be returned, if any, will be the first
one set in the provided 'mask'.
For NR_CPUS == 1, include/linux/cpumask.h now provides an inline
definition of cpumask_next_wrap(), which will conflict with the one
provided by lib/cpumask.c. Make building of lib/cpumask.o again depend
on CONFIG_SMP=y (i.e. NR_CPUS > 1) to avoid the re-definition.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Sander Vanheule <sander@svanheule.net>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Commit 36d4b36b69 ("lib/nodemask: inline next_node_in() and
node_random()") removed the lib/nodemask.c file, but the remove didn't
happen when the patch was applied.
Reported-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* more new_sync_{read,write}() speedups - ITER_UBUF introduction
* ITER_PIPE cleanups
* unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
* making ITER_PIPE take high-order pages without splitting them
* handling copy_page_from_iter() for high-order pages properly
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYvHI8QAKCRBZ7Krx/gZQ
62CQAPsGlbebqBeAT2pMulaGDxfLAsgz5Yf4BEaMLhPtRqFOQgD+KrZQId7Sd8O0
3IWucpTb2c4jvLlXhGMS+XWnusQH+AQ=
=pBux
-----END PGP SIGNATURE-----
Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more iov_iter updates from Al Viro:
- more new_sync_{read,write}() speedups - ITER_UBUF introduction
- ITER_PIPE cleanups
- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
- making ITER_PIPE take high-order pages without splitting them
- handling copy_page_from_iter() for high-order pages properly
* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
fix copy_page_from_iter() for compound destinations
hugetlbfs: copy_page_to_iter() can deal with compound pages
copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
expand those iov_iter_advance()...
pipe_get_pages(): switch to append_pipe()
get rid of non-advancing variants
ceph: switch the last caller of iov_iter_get_pages_alloc()
9p: convert to advancing variant of iov_iter_get_pages_alloc()
af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: saner helper for page array allocation
fold __pipe_get_pages() into pipe_get_pages()
ITER_XARRAY: don't open-code DIV_ROUND_UP()
unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
unify xarray_get_pages() and xarray_get_pages_alloc()
unify pipe_get_pages() and pipe_get_pages_alloc()
iov_iter_get_pages(): sanity-check arguments
iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
...
now that we are advancing the iterator, there's no need to
treat the first page separately - just call append_pipe()
in a loop.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>