When reading unformatted tracks on ESE devices, the corresponding memory
areas are simply set to zero for each segment. This is done incorrectly
for blocksizes < 4096.
There are two problems. First, the increment of dst is done using the
counter of the loop (off), which is increased by blksize every
iteration. This leads to a much bigger increment for dst as actually
intended. Second, the increment of dst is done before the memory area
is set to 0, skipping a significant amount of bytes of memory.
This leads to illegal overwriting of memory and ultimately to a kernel
panic.
This is not a problem with 4k blocksize because
blk_queue_max_segment_size is set to PAGE_SIZE, always resulting in a
single iteration for the inner segment loop (bv.bv_len == blksize). The
incorrectly used 'off' value to increment dst is 0 and the correct
memory area is used.
In order to fix this for blksize < 4k, increment dst correctly using the
blksize and only do it at the end of the loop.
Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For ESE devices we get an error for write operations on an unformatted
track. Afterwards the track will be formatted and the IO operation
restarted.
When using alias devices a track might be accessed by multiple requests
simultaneously and there is a race window that a track gets formatted
twice resulting in data loss.
Prevent this by remembering the amount of formatted tracks when starting
a request and comparing this number before actually formatting a track
on the fly. If the number has changed there is a chance that the current
track was finally formatted in between. As a result do not format the
track and restart the current IO to check.
The number of formatted tracks does not match the overall number of
formatted tracks on the device and it might wrap around but this is no
problem. It is only needed to recognize that a track has been formatted at
all in between.
Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For ESE devices we get an error when accessing an unformatted track.
The handling of this error will return zero data for read requests and
format the track on demand before writing to it. To do this the code needs
to distinguish between read and write requests. This is done with data from
the blocklayer request. A pointer to the blocklayer request is stored in
the CQR.
If there is an error on the device an ERP request is built to do error
recovery. While the ERP request is mostly a copy of the original CQR the
pointer to the blocklayer request is not copied to not accidentally pass
it back to the blocklayer without cleanup.
This leads to the error that during ESE handling after an ERP request was
built it is not possible to determine the IO direction. This leads to the
formatting of a track for read requests which might in turn lead to data
corruption.
Fixes: 5e2b17e712 ("s390/dasd: Add dynamic formatting support for ESE volumes")
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Link: https://lore.kernel.org/r/20220505141733.1989450-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Fix a race when we were calling folio_next() in the BIO folio iter
without holding a reference, meaning the folio could be split or freed,
and we'd jump to the next page instead of the intended next folio.
- Fix readahead creating single-page folios instead of the intended
large folios when doing reads that are not a power of two in size.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmJ0Xu4ACgkQDpNsjXcp
gj4rTAf/Rp2P9jwnOCN9X78YBiydkHq9dtIYbEz1jhOr2pnbz/ZWOeWvVvTBgG5I
GSIeaK3dhCBqi6G28QrQR1j1+gOWOJOs/rmJtkkOgBfoGsCL8HLFzcbXR10zeF2K
8bhivsq5tshn2DiVu8WK1W2n25mg4k7ORrBVcuUtW4Am8EPsyJpzoSWBTlZJvClt
Re9mIkbWNWktEyRiMl8wA4WRKqysaIWBuf9jugaOrv0Y0Db2TqiqYiAG6xm3VSZy
ABf8ZSOyNuxF6ZrW2tUjwdnJ6oDXjVB3Dykw4EQMFQ6uINJPBArj8AkDUe4FJa2w
9FmDLDxR1T4k9+8cEC6ZkkVb6KyvdQ==
=KKsJ
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache
Pull folio fixes from Matthew Wilcox:
"Two folio fixes for 5.18.
Darrick and Brian have done amazing work debugging the race I created
in the folio BIO iterator. The readahead problem was deterministic, so
easy to fix.
- Fix a race when we were calling folio_next() in the BIO folio iter
without holding a reference, meaning the folio could be split or
freed, and we'd jump to the next page instead of the intended next
folio.
- Fix readahead creating single-page folios instead of the intended
large folios when doing reads that are not a power of two in size"
* tag 'folio-5.18f' of git://git.infradead.org/users/willy/pagecache:
mm/readahead: Fix readahead with large folios
block: Do not call folio_next() on an unreferenced folio
- Drop unused 'max-link-speed' in Apple PCIe
- More redundant 'maxItems/minItems' schema fixes
- Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
- Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
- Add missing 'power-domains' property to Cadence UFSHC
-----BEGIN PGP SIGNATURE-----
iQJEBAABCgAuFiEEktVUI4SxYhzZyEuo+vtdtY28YcMFAmJz6UAQHHJvYmhAa2Vy
bmVsLm9yZwAKCRD6+121jbxhw1wzEACPQm9LVVSFxeevPMPfWEO3CqYdVYaLt2w9
AnQHng8OkUJkN5lEdhQdyrYpJpvYlqVJVx7FeXr839nBi+qzdCszeWEu3WH6+Tw3
HNp43QarhRAi5XyV87jBFQeuFONBBHEvOKuoqE8jQ/aFZ6hEOdrSQ9JVpcC5LPOK
dvokPcHaKQElWVa1oG4hqJjoEHvTQNo391+L/PmCxLBlIIWkSX/BtEnXKioes0nm
EXoYoMNQHsH3RtZThGb+NpafOgLE8oRODTJx+Q1zUDw9MEJHQCXEBv4Y+ene1pkg
KkMaPGXNQbI6XQaVSBXHXHkx1vh6BE400MlHpOpeXau+psjYzEF96ePpe2GYq3GY
TRFGvIj6W/We5VLdVlJ94CIdZAs1qSYQ0qIgzbg5pj7SK7KNLojkZiozg+qa5xXL
5Z1VQGmaTMqLukUmLjFpF63Zdxm2AmgifbQTO1eZTFC9xPB3mKoIuIXojr/5g84u
Jc05xNz/mbCnOwN3WNYLqc4bVS2bDhutfpDJOnn47RHI3vBhYiY1Xohe+Vm+Xu5s
6hKKxTn/2QW0dubn9/5BLfvNuH9/3jg0Pw0f/orJcBdI1OuLKCuNhAO4PTjFOAAZ
6wy5W5GOGwE7HgJGdb9BNQY0FHmaxL9lMWQrUuJ6yF7ghNKI2gO1MFWnzjYbsMaC
V91pihSiCg==
=fpGC
-----END PGP SIGNATURE-----
Merge tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Drop unused 'max-link-speed' in Apple PCIe
- More redundant 'maxItems/minItems' schema fixes
- Support values for pinctrl 'drive-push-pull' and 'drive-open-drain'
- Fix redundant 'unevaluatedProperties' in MT6360 LEDs binding
- Add missing 'power-domains' property to Cadence UFSHC
* tag 'devicetree-fixes-for-5.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: pci: apple,pcie: Drop max-link-speed from example
dt-bindings: Drop redundant 'maxItems/minItems' in if/then schemas
dt-bindings: pinctrl: Allow values for drive-push-pull and drive-open-drain
dt-bindings: leds-mt6360: Drop redundant 'unevaluatedProperties'
dt-bindings: ufs: cdns,ufshc: Add power-domains
The new state allowing device addition with paused balance is not
exported to user space so it can't recognize it and actually start the
operation.
Fixes: efc0e69c2f ("btrfs: introduce exclusive operation BALANCE_PAUSED state")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
a directory, we don't expect the insertion to fail with -EEXIST, because
we are holding the directory's log_mutex and we have dropped all existing
BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
the directory. However it's possible that during the logging we attempt
to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
happen we need to race with insertions of items from other inodes in the
subvolume's tree while we are logging a directory. Here's how this can
happen:
1) We are logging a directory with inode number 1000 that has its items
spread across 3 leaves in the subvolume's tree:
leaf A - has index keys from the range 2 to 20 for example. The last
item in the leaf corresponds to a dir item for index number 20. All
these dir items were created in a past transaction.
leaf B - has index keys from the range 22 to 100 for example. It has
no keys from other inodes, all its keys are dir index keys for our
directory inode number 1000. Its first key is for the dir item with
a sequence number of 22. All these dir items were also created in a
past transaction.
leaf C - has index keys for our directory for the range 101 to 120 for
example. This leaf also has items from other inodes, and its first
item corresponds to the dir item for index number 101 for our directory
with inode number 1000;
2) When we finish processing the items from leaf A at log_dir_items(),
we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
offset of 21, meaning the log is authoritative for the index range
from 21 to 21 (a single sequence number). At this point leaf B was
not yet modified in the current transaction;
3) When we return from log_dir_items() we have released our read lock on
leaf B, and have set *last_offset_ret to 21 (index number of the first
item on leaf B minus 1);
4) Some other task inserts an item for other inode (inode number 1001 for
example) into leaf C. That resulted in pushing some items from leaf C
into leaf B, in order to make room for the new item, so now leaf B
has dir index keys for the sequence number range from 22 to 102 and
leaf C has the dir items for the sequence number range 103 to 120;
5) At log_directory_changes() we call log_dir_items() again, passing it
a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
of leaf B, since leaf B was modified in the current transaction.
We have also initialized 'last_old_dentry_offset' to 20 after calling
btrfs_previous_item() at log_dir_items(), as it left us at the last
item of leaf A, which refers to the dir item with sequence number 20;
6) We then call process_dir_items_leaf() to process the dir items of
leaf B, and when we process the first item, corresponding to slot 0,
sequence number 22, we notice the dir item was created in a past
transaction and its sequence number is greater than the value of
*last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
error from insert_dir_log_key(), as we have already inserted that
key at step 2, triggering the assertion at process_dir_items_leaf().
The trace produced in dmesg is like the following:
assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
[198255.980839][ T7460] ------------[ cut here ]------------
[198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
[198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
[198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
[198255.989465][ T7460] Code: 8b 4c 89 (...)
[198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
[198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
[198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
[198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
[198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
[198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
[198256.007082][ T7460] FS: 00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
[198256.009939][ T7460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
[198256.015239][ T7460] Call Trace:
[198256.015674][ T7460] <TASK>
[198256.016313][ T7460] log_dir_items.cold+0x16/0x2c
[198256.018858][ T7460] ? replay_one_extent+0xbf0/0xbf0
[198256.025932][ T7460] ? release_extent_buffer+0x1d2/0x270
[198256.029658][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.031114][ T7460] ? lock_acquired+0xbe/0x660
[198256.032633][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.034386][ T7460] ? lock_release+0xcf/0x8a0
[198256.036152][ T7460] log_directory_changes+0xf9/0x170
[198256.036993][ T7460] ? log_dir_items+0xba0/0xba0
[198256.037661][ T7460] ? do_raw_write_unlock+0x7d/0xe0
[198256.038680][ T7460] btrfs_log_inode+0x233b/0x26d0
[198256.041294][ T7460] ? log_directory_changes+0x170/0x170
[198256.042864][ T7460] ? btrfs_attach_transaction_barrier+0x60/0x60
[198256.045130][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.046568][ T7460] ? lock_release+0xcf/0x8a0
[198256.047504][ T7460] ? lock_downgrade+0x420/0x420
[198256.048712][ T7460] ? ilookup5_nowait+0x81/0xa0
[198256.049747][ T7460] ? lock_downgrade+0x420/0x420
[198256.050652][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.051618][ T7460] ? __might_resched+0x128/0x1c0
[198256.052511][ T7460] ? __might_sleep+0x66/0xc0
[198256.053442][ T7460] ? __kasan_check_read+0x11/0x20
[198256.054251][ T7460] ? iget5_locked+0xbd/0x150
[198256.054986][ T7460] ? run_delayed_iput_locked+0x110/0x110
[198256.055929][ T7460] ? btrfs_iget+0xc7/0x150
[198256.056630][ T7460] ? btrfs_orphan_cleanup+0x4a0/0x4a0
[198256.057502][ T7460] ? free_extent_buffer+0x13/0x20
[198256.058322][ T7460] btrfs_log_inode+0x2654/0x26d0
[198256.059137][ T7460] ? log_directory_changes+0x170/0x170
[198256.060020][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.060930][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.061905][ T7460] ? lock_contended+0x770/0x770
[198256.062682][ T7460] ? btrfs_log_inode_parent+0xd04/0x1750
[198256.063582][ T7460] ? lock_downgrade+0x420/0x420
[198256.064432][ T7460] ? preempt_count_sub+0x18/0xc0
[198256.065550][ T7460] ? __mutex_lock+0x580/0xdc0
[198256.066654][ T7460] ? stack_trace_save+0x94/0xc0
[198256.068008][ T7460] ? __kasan_check_write+0x14/0x20
[198256.072149][ T7460] ? __mutex_unlock_slowpath+0x12a/0x430
[198256.073145][ T7460] ? mutex_lock_io_nested+0xcd0/0xcd0
[198256.074341][ T7460] ? wait_for_completion_io_timeout+0x20/0x20
[198256.075345][ T7460] ? lock_downgrade+0x420/0x420
[198256.076142][ T7460] ? lock_contended+0x770/0x770
[198256.076939][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.078401][ T7460] ? btrfs_sync_file+0x5e6/0xa40
[198256.080598][ T7460] btrfs_log_inode_parent+0x523/0x1750
[198256.081991][ T7460] ? wait_current_trans+0xc8/0x240
[198256.083320][ T7460] ? lock_downgrade+0x420/0x420
[198256.085450][ T7460] ? btrfs_end_log_trans+0x70/0x70
[198256.086362][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.087544][ T7460] ? lock_release+0xcf/0x8a0
[198256.088305][ T7460] ? lock_downgrade+0x420/0x420
[198256.090375][ T7460] ? dget_parent+0x8e/0x300
[198256.093538][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.094918][ T7460] ? lock_downgrade+0x420/0x420
[198256.097815][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.101822][ T7460] ? dget_parent+0xb7/0x300
[198256.103345][ T7460] btrfs_log_dentry_safe+0x48/0x60
[198256.105052][ T7460] btrfs_sync_file+0x629/0xa40
[198256.106829][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.109655][ T7460] ? __fget_files+0x161/0x230
[198256.110760][ T7460] vfs_fsync_range+0x6d/0x110
[198256.111923][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.113556][ T7460] __x64_sys_fsync+0x45/0x70
[198256.114323][ T7460] do_syscall_64+0x5c/0xc0
[198256.115084][ T7460] ? syscall_exit_to_user_mode+0x3b/0x50
[198256.116030][ T7460] ? do_syscall_64+0x69/0xc0
[198256.116768][ T7460] ? do_syscall_64+0x69/0xc0
[198256.117555][ T7460] ? do_syscall_64+0x69/0xc0
[198256.118324][ T7460] ? sysvec_call_function_single+0x57/0xc0
[198256.119308][ T7460] ? asm_sysvec_call_function_single+0xa/0x20
[198256.120363][ T7460] entry_SYSCALL_64_after_hwframe+0x44/0xae
[198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
[198256.122067][ T7460] Code: 0f 05 48 (...)
[198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
[198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
[198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
[198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
[198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
[198256.133403][ T7460] </TASK>
Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
it update the item with an end offset corresponding to the maximum between
the previously logged end offset and the new requested end offset. The end
offsets may be different due to dir index key deletions that happened as
part of unlink operations while we are logging a directory (triggered when
fsyncing some other inode parented by the directory) or during renames
which always attempt to log a single dir index deletion.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
Fixes: 732d591a5d ("btrfs: stop copying old dir items when logging a directory")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_zone_activate() checks if it activated all the underlying zones in
the loop. However, that check never hit on an unlimited activate zone
device (max_active_zones == 0).
Fortunately, it still works without ENOSPC because btrfs_zone_activate()
returns true in the end, even if block_group->zone_is_active == 0. But, it
is confusing to have non zone_is_active block group still usable for
allocation. Also, we are wasting CPU time to iterate the loop every time
btrfs_zone_activate() is called for the blog groups.
Since error case in the loop is handled by out_unlock, we can just set
zone_is_active and do the list stuff after the loop.
Fixes: f9a912a3c4 ("btrfs: zoned: make zone activation multi stripe capable")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_zone_activate() checks if block_group->alloc_offset ==
block_group->zone_capacity every time it iterates the loop. But, it is
not depending on the index. Move out the check and do it only once.
Fixes: f9a912a3c4 ("btrfs: zoned: make zone activation multi stripe capable")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For a 4K sector sized btrfs with v1 cache enabled and only mounted on
systems with 4K page size, if it's mounted on subpage (64K page size)
systems, it can cause the following warning on v1 space cache:
BTRFS error (device dm-1): csum mismatch on free space cache
BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now
Although not a big deal, as kernel can rebuild it without problem, such
warning will bother end users, especially if they want to switch the
same btrfs seamlessly between different page sized systems.
[CAUSE]
V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
like BITS_PER_BITMAP.
Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
cache size to checksum.
Thus kernel will always reject v1 cache with a different PAGE_SIZE with
csum mismatch.
[FIX]
Although we should fix v1 cache, it's already going to be marked
deprecated soon.
And we have v2 cache based on metadata (which is already fully subpage
compatible), and it has almost everything superior than v1 cache.
So just force subpage mount to use v2 cache on mount.
Reported-by: Matt Corallo <blnxfsl@bluematt.me>
CC: stable@vger.kernel.org # 5.15+
Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Disable -Warray-bounds warning for gcc12, since there known way to
workaround false positive warnings on lowcore accesses would result
in worse code on fast paths.
- Avoid lockdep_assert_held() warning in kvm vm memop code.
- Reduce overhead within gmap_rmap code to get rid of long latencies
when e.g. shutting down 2nd level guests.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmJzvigACgkQIg7DeRsp
bsLM7A/7BFCekdwNQSQMHPzEo4auzowIRDVbj+EE5MyoSoi+t5tQ67fgpZPYLqAZ
W3PFxrX0jxIFBpy2NuCzwPjEwDnApqKlmwiwHJ1euXxRDSoOsX4sl/l9sore5eaK
NCo5VC9d1y0sSCS6hQ+SWLU9jDImFdBSdPMpNQw2SWwN1MCWHKoE996XJ5VwkDau
5s21nM3uO43yZdLaGlfTAdoBvIDHC0UX5NAto0W988s/eReBnoKudL6ZRflbbW5H
/dao1oyN90adTnSrj1BMl182Cx8OyQzeMtud0vud7hYzmxO/SxWc5doLKv/cuXkx
fsPnJ7smnQw2By5xeEt4xj10amLU6c+tXI+PS0YdxBP7odz3dcXBI1Bt6yDO0NH/
LotbA9v/D4VhTXOysK9fnlKI/7cKDNt/kE5kBTyGtSb4AfL2LRnYwRvoCnfjJ57j
gBNa48bTLfX5Nz6BFLDzOAsLPoGaKT6Eun7l3iaK864pGBCimvpdM1gNghzfIJSY
2C6cJxqoCDXYWFt4TWaZaGPs1J2DI6AtucIA/FlMmV7YqYyOIJxUh/j3fh8ln8+/
eCg1CQwj3IIsnkA6lQVc7Ne01ita9m8kTd1Ep6o5xqQXg46FcGOuJjBYkKbCzVxX
kG9pjg4ATU8kgGH3hvRa9Wy3s/w+AyKfrbht/M/GFk289ski0yM=
=KjLw
-----END PGP SIGNATURE-----
Merge tag 's390-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Disable -Warray-bounds warning for gcc12, since the only known way to
workaround false positive warnings on lowcore accesses would result
in worse code on fast paths.
- Avoid lockdep_assert_held() warning in kvm vm memop code.
- Reduce overhead within gmap_rmap code to get rid of long latencies
when e.g. shutting down 2nd level guests.
* tag 's390-5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
KVM: s390: vsie/gmap: reduce gmap_rmap overhead
KVM: s390: Fix lockdep issue in vm memop
s390: disable -Warray-bounds
wireguard
Previous releases - regressions:
- igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
- mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
- rds: acquire netns refcount on TCP sockets
- rxrpc: enable IPv6 checksums on transport socket
- nic: hinic: fix bug of wq out of bound access
- nic: thunder: don't use pci_irq_vector() in atomic context
- nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS flag
- nic: mlx5e:
- lag, fix use-after-free in fib event handler
- fix deadlock in sync reset flow
Previous releases - always broken:
- tcp: fix insufficient TCP source port randomness
- can: grcan: grcan_close(): fix deadlock
- nfc: reorder destructive operations in to avoid bugs
Misc:
- wireguard: improve selftests reliability
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmJznX8SHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOkDrgP/R9tErvWO/uvXpNgDr6Qh8osYt5Z297l
EWyhz7cUm4LKi6MYWrRKR4uRK9n43DK+OVws5LXrYL0tIdJH3uYBE0RS67W9WmjA
kE2Srq1A6wUi4koiYKeYDXtodCJLC93n+QnLBfih44Pc+xmk8t+G6qZ1n45qjRss
gzV75AlIfErmjqyYi81DaZ6Z0TV4H5qPM4ZXRViIzH+Ccyx6rk/KNqU4wepoqRSi
lCckTvMt9V7OiYHzM5Pu1kTUV07Jtiy7xkIQMdKYXCZpyqkmqyPFMM+0B7fDOEeP
WZnkdUwi69WMVmeefcpEn7XsoNbVadGkTQM2EcUWvrxuCeawmGxYoORvvFs0IpAX
YkYXk1US0Sd1L2XlMaus+HLsmmx4fWnb/hWqGL/D+arZOvTCOhBQItSRmKA6d+kM
OLfj/gh0YLBsHVrCiHUN06oopvhWuBEBAJbVFkbJCvXoFGqHigijBCVjFBVH1p4o
L5bWVEAQ8tkFdofXw0nOe6vRCD5BGN34N5DkqC5E8mj/uLP0FVEWOISV3TzKKF5B
mEDGZAGN5bTf/ScvbF8XEaqtdk/cxv2ohWNn9wtgoaNBorgKtpTf99pXJtxV2+fs
3RiPM0My9uz8/wMveSfKShQntMSdnmQPMpJ4Vm0e4bOS1K0LRGUgZxOpX2/BTokq
Iv5msx85X5/S
=XuN7
-----END PGP SIGNATURE-----
Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, rxrpc and wireguard.
Previous releases - regressions:
- igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
- mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
- rds: acquire netns refcount on TCP sockets
- rxrpc: enable IPv6 checksums on transport socket
- nic: hinic: fix bug of wq out of bound access
- nic: thunder: don't use pci_irq_vector() in atomic context
- nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS
flag
- nic: mlx5e:
- lag, fix use-after-free in fib event handler
- fix deadlock in sync reset flow
Previous releases - always broken:
- tcp: fix insufficient TCP source port randomness
- can: grcan: grcan_close(): fix deadlock
- nfc: reorder destructive operations in to avoid bugs
Misc:
- wireguard: improve selftests reliability"
* tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits)
NFC: netlink: fix sleep in atomic bug when firmware download timeout
selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
tcp: drop the hash_32() part from the index calculation
tcp: increase source port perturb table to 2^16
tcp: dynamically allocate the perturb table used by source ports
tcp: add small random increments to the source port
tcp: resalt the secret every 10 seconds
tcp: use different parts of the port_offset for index and offset
secure_seq: use the 64 bits of the siphash for port offset calculation
wireguard: selftests: set panic_on_warn=1 from cmdline
wireguard: selftests: bump package deps
wireguard: selftests: restore support for ccache
wireguard: selftests: use newer toolchains to fill out architectures
wireguard: selftests: limit parallelism to $(nproc) tests at once
wireguard: selftests: make routing loop test non-fatal
net/mlx5: Fix matching on inner TTC
net/mlx5: Avoid double clear or set of sync reset requested
net/mlx5: Fix deadlock in sync reset flow
net/mlx5e: Fix trust state reset in reload
net/mlx5e: Avoid checking offload capability in post_parse action
...
The fwnode of GPIO IRQ must be set to its own fwnode, not the fwnode of the
parent IRQ. Therefore, this sets own fwnode instead of the parent IRQ fwnode to
GPIO IRQ's.
Fixes: 2ad74f40da ("gpio: visconti: Add Toshiba Visconti GPIO support")
Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
My git tree has become the de facto main GPIO tree. Update the
MAINTAINERS file to reflect that.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Reported-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
There are sleep in atomic bug that could cause kernel panic during
firmware download process. The root cause is that nlmsg_new with
GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer
handler. The call trace is shown below:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265
Call Trace:
kmem_cache_alloc_node
__alloc_skb
nfc_genl_fw_download_done
call_timer_fn
__run_timers.part.0
run_timer_softirq
__do_softirq
...
The nlmsg_new with GFP_KERNEL parameter may sleep during memory
allocation process, and the timer handler is run as the result of
a "software interrupt" that should not call any other function
that could sleep.
This patch changes allocation mode of netlink message from GFP_KERNEL
to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: 9674da8759 ("NFC: Add firmware upload netlink command")
Fixes: 9ea7187c53 ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD")
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reading 100KB chunks from a big file (eg dd bs=100K) leads to poor
readahead behaviour. Studying the traces in detail, I noticed two
problems.
The first is that we were setting the readahead flag on the folio which
contains the last byte read from the block. This is wrong because we
will trigger readahead at the end of the read without waiting to see
if a subsequent read is going to use the pages we just read. Instead,
we need to set the readahead flag on the first folio _after_ the one
which contains the last byte that we're reading.
The second is that we were looking for the index of the folio with the
readahead flag set to exactly match the start + size - async_size.
If we've rounded this, either down (as previously) or up (as now),
we'll think we hit a folio marked as readahead by a different read,
and try to read the wrong pages. So round the expected index to the
order of the folio we hit.
Reported-by: Guo Xuenan <guoxuenan@huawei.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
It is unsafe to call folio_next() on a folio unless you hold a reference
on it that prevents it from being split or freed. After returning
from the iterator, iomap calls folio_end_writeback() which may drop
the last reference to the page, or allow the page to be split. If that
happens, the iterator will not advance far enough through the bio_vec,
leading to assertion failures like the BUG() in folio_end_writeback()
that checks we're not trying to end writeback on a page not currently
under writeback. Other assertion failures were also seen, but they're
all explained by this one bug.
Fix the bug by remembering where the next folio starts before returning
from the iterator. There are other ways of fixing this bug, but this
seems the simplest.
Reported-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Reported-by: Brian Foster <bfoster@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
As discussed here with Ido Schimmel:
https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/
the default conform-exceed action is "reclassify", for a reason we don't
really understand.
The point is that hardware can't offload that police action, so not
specifying "conform-exceed" was always wrong, even though the command
used to work in hardware (but not in software) until the kernel started
adding validation for it.
Fix the command used by the selftest by making the policer drop on
exceed, and pass the packet to the next action (goto) on conform.
Fixes: 8cd6b020b6 ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Willy Tarreau says:
====================
insufficient TCP source port randomness
In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad
report being able to accurately identify a client by forcing it to emit
only 40 times more connections than the number of entries in the
table_perturb[] table, which is indexed by hashing the connection tuple.
The current 2^8 setting allows them to perform that attack with only 10k
connections, which is not hard to achieve in a few seconds.
Eric, Amit and I have been working on this for a few weeks now imagining,
testing and eliminating a number of approaches that Amit and his team were
still able to break or that were found to be too risky or too expensive,
and ended up with the simple improvements in this series that resists to
the attack, doesn't degrade the performance, and preserves a reliable port
selection algorithm to avoid connection failures, including the odd/even
port selection preference that allows bind() to always find a port quickly
even under strong connect() stress.
The approach relies on several factors:
- resalting the hash secret that's used to choose the table_perturb[]
entry every 10 seconds to eliminate slow attacks and force the
attacker to forget everything that was learned after this delay.
This already eliminates most of the problem because if a client
stays silent for more than 10 seconds there's no link between the
previous and the next patterns, and 10s isn't yet frequent enough
to cause too frequent repetition of a same port that may induce a
connection failure ;
- adding small random increments to the source port. Previously, a
random 0 or 1 was added every 16 ports. Now a random 0 to 7 is
added after each port. This means that with the default 32768-60999
range, a worst case rollover happens after 1764 connections, and
an average of 3137. This doesn't stop statistical attacks but
requires significantly more iterations of the same attack to
confirm a guess.
- increasing the table_perturb[] size from 2^8 to 2^16, which Amit
says will require 2.6 million connections to be attacked with the
changes above, making it pointless to get a fingerprint that will
only last 10 seconds. Due to the size, the table was made dynamic.
- a few minor improvements on the bits used from the hash, to eliminate
some unfortunate correlations that may possibly have been exploited
to design future attack models.
These changes were tested under the most extreme conditions, up to
1.1 million connections per second to one and a few targets, showing no
performance regression, and only 2 connection failures within 13 billion,
which is less than 2^-32 and perfectly within usual values.
The series is split into small reviewable changes and was already reviewed
by Amit and Eric.
====================
Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 190cc82489 ("tcp: change source port randomizarion at
connect() time"), the table_perturb[] array was introduced and an
index was taken from the port_offset via hash_32(). But it turns
out that hash_32() performs a multiplication while the input here
comes from the output of SipHash in secure_seq, that is well
distributed enough to avoid the need for yet another hash.
Suggested-by: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
identify a client by forcing it to emit only 40 times more connections
than there are entries in the table_perturb[] table. The previous two
improvements consisting in resalting the secret every 10s and adding
randomness to each port selection only slightly improved the situation,
and the current value of 2^8 was too small as it's not very difficult
to make a client emit 10k connections in less than 10 seconds.
Thus we're increasing the perturb table from 2^8 to 2^16 so that the
same precision now requires 2.6M connections, which is more difficult in
this time frame and harder to hide as a background activity. The impact
is that the table now uses 256 kB instead of 1 kB, which could mostly
affect devices making frequent outgoing connections. However such
components usually target a small set of destinations (load balancers,
database clients, perf assessment tools), and in practice only a few
entries will be visited, like before.
A live test at 1 million connections per second showed no performance
difference from the previous value.
Reported-by: Moshe Kol <moshe.kol@mail.huji.ac.il>
Reported-by: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We'll need to further increase the size of this table and it's likely
that at some point its size will not be suitable anymore for a static
table. Let's allocate it on boot from inet_hashinfo2_init(), which is
called from tcp_init().
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Here we're randomly adding between 0 and 7 random increments to the
selected source port in order to add some noise in the source port
selection that will make the next port less predictable.
With the default port range of 32768-60999 this means a worst case
reuse scenario of 14116/8=1764 connections between two consecutive
uses of the same port, with an average of 14116/4.5=3137. This code
was stressed at more than 800000 connections per second to a fixed
target with all connections closed by the client using RSTs (worst
condition) and only 2 connections failed among 13 billion, despite
the hash being reseeded every 10 seconds, indicating a perfectly
safe situation.
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In order to limit the ability for an observer to recognize the source
ports sequence used to contact a set of destinations, we should
periodically shuffle the secret. 10 seconds looks effective enough
without causing particular issues.
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Klein suggests that we use different parts of port_offset for the
table's index and the port offset so that there is no direct relation
between them.
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit
7cd23e5300 ("secure_seq: use SipHash in place of MD5"), but the output
remained truncated to 32-bit only. In order to exploit more bits from the
hash, let's make the functions return the full 64-bit of siphash_3u32().
We also make sure the port offset calculation in __inet_hash_connect()
remains done on 32-bit to avoid the need for div_u64_rem() and an extra
cost on 32-bit systems.
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Moshe Kol <moshe.kol@mail.huji.ac.il>
Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il>
Cc: Amit Klein <aksecurity@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason A. Donenfeld says:
====================
wireguard patches for 5.18-rc6
In working on some other problems, I wound up leaning on the WireGuard
CI more than usual and uncovered a few small issues with reliability.
These are fairly low key changes, since they don't impact kernel code
itself.
One change does stick out in particular, though, which is the "make
routing loop test non-fatal" commit. I'm not thrilled about doing this,
but currently [1] remains unsolved, and I'm still working on a real
solution to that (hopefully for 5.19 or 5.20 if I can come up with a
good idea...), so for now that test just prints a big red warning
instead.
[1] https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/
====================
Link: https://lore.kernel.org/r/20220504202920.72908-1-Jason@zx2c4.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rather than setting this once init is running, set panic_on_warn from
the kernel command line, so that it catches splats from WireGuard
initialization code and the various crypto selftests.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use newer, more reliable package dependencies. These should hopefully
reduce flakes. However, we keep the old iputils package, as it
accumulated bugs after resulting in flakes on slow machines.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When moving to non-system toolchains, we inadvertantly killed the
ability to use ccache. So instead, build ccache support into the test
harness directly.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rather than relying on the system to have cross toolchains available,
simply download musl.cc's ones and use that libc.so, and then we use it
to fill in a few missing platforms, such as riscv64, riscv64, powerpc64,
and s390x.
Since riscv doesn't have a second serial port in its device description,
we have to use virtio's vport. This is actually the same situation on
ARM, but we were previously hacking QEMU up to work around this, which
required a custom QEMU. Instead just do the vport trick on ARM too.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The parallel tests were added to catch queueing issues from multiple
cores. But what happens in reality when testing tons of processes is
that these separate threads wind up fighting with the scheduler, and we
wind up with contention in places we don't care about that decrease the
chances of hitting a bug. So just do a test with the number of CPU
cores, rather than trying to scale up arbitrarily.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rxe_mcast.c currently uses _irqsave spinlocks for rxe->mcg_lock while
rxe_recv.c uses _bh spinlocks for the same lock.
As there is no case where the mcg_lock can be taken from an IRQ, change
these all to bh locks so we don't have confusing mismatched lock types on
the same spinlock.
Fixes: 6090a0c4c7 ("RDMA/rxe: Cleanup rxe_mcast.c")
Link: https://lore.kernel.org/r/20220504202817.98247-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
These routines were not intended to be called under a spinlock and will
throw debugging warnings:
raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 13 PID: 3107 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x2f/0x50
CPU: 13 PID: 3107 Comm: python3 Tainted: G E 5.18.0-rc1+ #7
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:warn_bogus_irq_restore+0x2f/0x50
Call Trace:
<TASK>
_raw_spin_unlock_irqrestore+0x75/0x80
rxe_attach_mcast+0x304/0x480 [rdma_rxe]
ib_attach_mcast+0x88/0xa0 [ib_core]
ib_uverbs_attach_mcast+0x186/0x1e0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xcd/0x140 [ib_uverbs]
ib_uverbs_cmd_verbs+0xdb0/0xea0 [ib_uverbs]
ib_uverbs_ioctl+0xd2/0x160 [ib_uverbs]
do_syscall_64+0x5c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Move them out of the spinlock, it is OK if there is some races setting up
the MC reception at the ethernet layer with rbtree lookups.
Fixes: 6090a0c4c7 ("RDMA/rxe: Cleanup rxe_mcast.c")
Link: https://lore.kernel.org/r/20220504202817.98247-1-rpearsonhpe@gmail.com
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The calling of siw_cm_upcall and detaching new_cep with its listen_cep
should be atomistic semantics. Otherwise siw_reject may be called in a
temporary state, e,g, siw_cm_upcall is called but the new_cep->listen_cep
has not being cleared.
This fixes a WARN:
WARNING: CPU: 7 PID: 201 at drivers/infiniband/sw/siw/siw_cm.c:255 siw_cep_put+0x125/0x130 [siw]
CPU: 2 PID: 201 Comm: kworker/u16:22 Kdump: loaded Tainted: G E 5.17.0-rc7 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: iw_cm_wq cm_work_handler [iw_cm]
RIP: 0010:siw_cep_put+0x125/0x130 [siw]
Call Trace:
<TASK>
siw_reject+0xac/0x180 [siw]
iw_cm_reject+0x68/0xc0 [iw_cm]
cm_work_handler+0x59d/0xe20 [iw_cm]
process_one_work+0x1e2/0x3b0
worker_thread+0x50/0x3a0
? rescuer_thread+0x390/0x390
kthread+0xe5/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>
Fixes: 6c52fdc244 ("rdma/siw: connection management")
Link: https://lore.kernel.org/r/d528d83466c44687f3872eadcb8c184528b2e2d4.1650526554.git.chengyou@linux.alibaba.com
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
We no longer use these since 111659c2a5 (and they never worked
anyway); drop them from the example to avoid confusion.
Fixes: 111659c2a5 ("arm64: dts: apple: t8103: Remove PCIe max-link-speed properties")
Signed-off-by: Hector Martin <marcan@marcan.st>
Reviewed-by: Alyssa Rosenzweig <alyssa@rosenzweig.io>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220502091308.28233-1-marcan@marcan.st
Another round of removing redundant minItems/maxItems when 'items' list is
specified. This time it is in if/then schemas as the meta-schema was
failing to check this case.
If a property has an 'items' list, then a 'minItems' or 'maxItems' with the
same size as the list is redundant and can be dropped. Note that is DT
schema specific behavior and not standard json-schema behavior. The tooling
will fixup the final schema adding any unspecified minItems/maxItems.
Signed-off-by: Rob Herring <robh@kernel.org>
Acked-By: Vinod Koul <vkoul@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> #for IIO
Link: https://lore.kernel.org/r/20220503162738.3827041-1-robh@kernel.org
A few platforms, at91 and tegra, use drive-push-pull and
drive-open-drain with a 0 or 1 value. There's not really a need for values
as '1' should be equivalent to no value (it wasn't treated that way) and
drive-push-pull disabled is equivalent to drive-open-drain. So dropping the
value can't be done without breaking existing OSs. As we don't want new
cases, mark the case with values as deprecated.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Jonathan Hunter <jonathanh@nvidia.com>
Cc: Nicolas Ferre <nicolas.ferre@microchip.com>
Cc: Claudiu Beznea <claudiu.beznea@microchip.com>
Signed-off-by: Rob Herring <robh@kernel.org>
Link: https://lore.kernel.org/r/20220429194610.2741437-1-robh@kernel.org
Including:
- Fix for a regression in IOMMU core code which could cause NULL-ptr
dereferences
- Arm SMMU fixes for 5.18
- Fix off-by-one in SMMUv3 SVA TLB invalidation
- Disable large mappings to workaround nvidia erratum
- Intel VT-d fixes
- Handle PCI stop marker messages in IOMMU driver to meet the
requirement of I/O page fault handling framework.
- Calculate a feasible mask for non-aligned page-selective IOTLB
invalidation.
- Two fixes for Apple DART IOMMU driver
- Fix potential NULL-ptr dereference
- Set module owner
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmJyoYIACgkQK/BELZcB
GuMLOhAAu4ZTxHmBbS/JDt0+DUKODB2z3vRDBoZxI+c6QCRvu5PhrCnR1nTmoNGg
Falu/QlzcVmTupUOT0OLrTd4H8KKzlXJuFcVJAQh98ReSQ5zIkBX+b7ruL3+KKAU
Y/zDy/nD5G9G1vM9/PjZsSvr2U9Vq1p3qNArEZDnd8UYaVokkC0nNsxQhAoVA43U
WjLR9Y7wCnxLFweaa3ujzSzmAOgVcLeMBV/siaRcpu4ohwHS6kKOikVAJWlgp/pc
bCHpfxJbBLSG7lIAagWHbC4EF8GxTl/+u1+8pw1fVXpBG4Qaj60VSiITznLqSBi4
avTTlwwPsLQOMshmARllIDWJny7TCEJZhqxSO22ki9Em32+zMA7e2ozKp2cSwyn2
qXFnF4o3lSpDE99qnQPHWqWXQVVabkuLgwyqqDHMg0BSKx3MPDhagXdmYh/eI9hF
j0BeY1hkMJU+kt+bNTSKSwwsooanG8hZvsTepct9+oeIZndEbyG0oKi7nW+xK2zD
bKDyrY8cr2MBIBcV+yiDPQoCmbegDBouxCZtkLDo4cug30IVqDWvu5efld1WkhBX
BLJC5LI+EibhzEKBiHk6qb3Gz4nIvvbAs+oGVdPsjI2LMrICq4eJ4BFaZT8I8PfY
QGg9iKW3SCtfwYlfm6/a2Y23k72T86zZFlgMW7fEN4pyR5SEkcc=
=jB96
-----END PGP SIGNATURE-----
Merge tag 'iomm-fixes-v5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
"IOMMU core:
- Fix for a regression which could cause NULL-ptr dereferences
Arm SMMU:
- Fix off-by-one in SMMUv3 SVA TLB invalidation
- Disable large mappings to workaround nvidia erratum
Intel VT-d:
- Handle PCI stop marker messages in IOMMU driver to meet the
requirement of I/O page fault handling framework.
- Calculate a feasible mask for non-aligned page-selective IOTLB
invalidation.
Apple DART IOMMU:
- Fix potential NULL-ptr dereference
- Set module owner"
* tag 'iomm-fixes-v5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu: Make sysfs robust for non-API groups
iommu/dart: Add missing module owner to ops structure
iommu/dart: check return value after calling platform_get_resource()
iommu/vt-d: Drop stop marker messages
iommu/vt-d: Calculate mask for non-aligned flushes
iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu
iommu/arm-smmu-v3: Fix size calculation in arm_smmu_mm_invalidate_range()
This has been in for-next for a bit (longer than the times would
indicate, I had to rebase to add some text to the headers) and these are
fixes that need to go in.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE/Q1c5nzg9ZpmiCaGYfOMkJGb/4EFAmJyaB4ACgkQYfOMkJGb
/4HOwBAAqdLDIo7WSYjWqmK4hUzLS5iBvgLiH41QKt/AVGkwX20YYANmlQFRShT/
Ue08cjbCcYKVyQCLAnV7nG4RU26U+eAlLL/owH8CVLGXEND9IGahzii1Up350fhX
EjZgoHoEiU+xTGiRs/Tsag0v5WL++78qiXi7tQX2NsWwBT+vCDMEo/+Xv9h+TKzg
okVi78/4LATI5jZJpCkz7uRjrEIPoUs5rTkfe5p4EhBfis+pd0CIs0gIsrsxB12p
cnD55NOJoxAS0EQnRiLakjy320GrvlHUy0VfFByEkMWKWhl4c6aE60mWI85CcGMo
qrovIeAiGZ8r3gqwP986WJBNbSve3f9b2gNeLlai+k4loeyANUOoTx7rgGZzzqNo
KKJd9v/wjDXW/cGF41sTYoFHpJXF/CZfm6rusYcQrC/L052sEwG3clasWsmVsmml
suRbEYzDoDPfyOqEcLztA/OJr6bkA+/eu458W4Q75WWNfF86mM1KapZjoKNXokq+
baE855AP4bdyoG+1lKpORk+3oOlqMWYyP5Bkmkz/JIyQAmzB8DhJZeD4SMHkMgp2
Bb/rjm/AZRxjEX/bvQyO31mpSaqM39Yxua4qRZcmxAW+QXTYBgtE9psNxZ7BGpRJ
dfl/siFeVMUBUnFTlB5pf49Mt9QZomcQ2wBZ029gHncWvOjCSXs=
=f3CV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.17-2' of https://github.com/cminyard/linux-ipmi
Pull IPMI fixes from Corey Minyard:
"Fix some issues that were reported.
This has been in for-next for a bit (longer than the times would
indicate, I had to rebase to add some text to the headers) and these
are fixes that need to go in"
* tag 'for-linus-5.17-2' of https://github.com/cminyard/linux-ipmi:
ipmi:ipmi_ipmb: Fix null-ptr-deref in ipmi_unregister_smi()
ipmi: When handling send message responses, don't process the message
A faulty receiver might report an erroneous channel count. We
should guard against reading beyond AUDIO_CHANNELS_COUNT as
that would overflow the dpcd_pattern_period array.
Signed-off-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
While technically Xen dom0 is a virtual machine too, it does have
access to most of the hardware so it doesn't need to be considered a
"passthrough". Commit b818a5d374 ("drm/amdgpu/gmc: use PCI BARs for
APUs in passthrough") changed how FB is accessed based on passthrough
mode. This breaks amdgpu in Xen dom0 with message like this:
[drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
While the reason for this failure is unclear, the passthrough mode is
not really necessary in Xen dom0 anyway. So, to unbreak booting affected
kernels, disable passthrough mode in this case.
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1985
Fixes: b818a5d374 ("drm/amdgpu/gmc: use PCI BARs for APUs in passthrough")
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Groups created by VFIO backends outside the core IOMMU API should never
be passed directly into the API itself, however they still expose their
standard sysfs attributes, so we can still stumble across them that way.
Take care to consider those cases before jumping into our normal
assumptions of a fully-initialised core API group.
Fixes: 3f6634d997 ("iommu: Use right way to retrieve iommu_ops")
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/86ada41986988511a8424e84746dfe9ba7f87573.1651667683.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmJyJHcACgkQSD+KveBX
+j60dQf/UFKW7VzUfUyT4+pgMZHdCQLsEkf7lgHzFjcWD3XaiBr7him3RIFl0ZrI
0mD2AHA7YT6ePtub8Pjyqbfu4hC7rYWMDW6J2iqV7V05co/AVdWLFgxex1zEMhK8
nD2f4nFkgs5rOeSn6tOj0ttSlXQOPmPPh5FfIGy2oiGTeNWOYG5URw1crOoNe+HG
0Z4uAmQDXETFLgim+A8DDTrbjSY60EDT/qnpOhPISpE/EtiG8OCaPDtcsSEmF9re
0LebIKUvx4X9fsHPvxyC0TfKvYpxSHkbz+VAx//O8+aGmHKgco07/3aeWM5Z9D/7
ulbQJQTIwcsM3GwlAFygCjFWFqiLvQ==
=gXIJ
-----END PGP SIGNATURE-----
Merge tag 'mlx5-fixes-2022-05-03' of git://git.kernel.org/pub/scm/linux/kernel/g
it/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-05-03
This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>