[ Upstream commit f9ca51596bbfd0f9c386dd1c613c394c78d9e5e6 ]
Fix CVE: CVE-2024-42304
The syzbot constructs a directory that has no dirblock but is non-inline,
i.e. the first directory block is a hole. And no errors are reported when
creating files in this directory in the following flow.
ext4_mknod
...
ext4_add_entry
// Read block 0
ext4_read_dirblock(dir, block, DIRENT)
bh = ext4_bread(NULL, inode, block, 0)
if (!bh && (type == INDEX || type == DIRENT_HTREE))
// The first directory block is a hole
// But type == DIRENT, so no error is reported.
After that, we get a directory block without '.' and '..' but with a valid
dentry. This may cause some code that relies on dot or dotdot (such as
make_indexed_dir()) to crash.
Therefore when ext4_read_dirblock() finds that the first directory block
is a hole report that the filesystem is corrupted and return an error to
avoid loading corrupted data from disk causing something bad.
Reported-by: syzbot+ae688d469e36fb5138d0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=ae688d469e36fb5138d0
Fixes: 4e19d6b65f ("ext4: allow directory holes")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240702132349.2600605-3-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit 4811f7af6090e8f5a398fbdd766f903ef6c0d787 ]
Fix CVE: CVE-2024-42295
Syzbot reported that a buffer state inconsistency was detected in
nilfs_btnode_create_block(), triggering a kernel bug.
It is not appropriate to treat this inconsistency as a bug; it can occur
if the argument block address (the buffer index of the newly created
block) is a virtual block number and has been reallocated due to
corruption of the bitmap used to manage its allocation state.
So, modify nilfs_btnode_create_block() and its callers to treat it as a
possible filesystem error, rather than triggering a kernel bug.
Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com
Fixes: a60be987d4 ("nilfs2: B-tree node cache")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit 4ed886b187f47447ad559619c48c086f432d2b77 ]
Fix CVE: CVE-2024-42160
- It missed to check validation of fault attrs in parse_options(),
let's fix to add check condition in f2fs_build_fault_attr().
- Use f2fs_build_fault_attr() in __sbi_store() to clean up code.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit c462ecd659b5fce731f1d592285832fd6ad54053 ]
Fix CVE: CVE-2024-41077
Block size should be between 512 and PAGE_SIZE and be a power of 2. The current
check does not validate this, so update the check.
Without this patch, null_blk would Oops due to a null pointer deref when
loaded with bs=1536 [1].
Link: https://lore.kernel.org/all/87wmn8mocd.fsf@metaspace.dk/
Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240603192645.977968-1-nmi@metaspace.dk
[axboe: remove unnecessary braces and != 0 check]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit c462ecd659b5fce731f1d592285832fd6ad54053 ]
Fix CVE: CVE-2024-41077
For zoned=1 mode, the zone size must be a power of 2. Check this not
only when the zone size is specified during modprobe, but also when
creating a zoned null_blk device using configfs.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit 753039ef8b upstream
[tapd]
https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414119099349
Family 19h CPUs are Zen-based and still share most architectural
features with Family 17h CPUs, and therefore still need to call
init_amd_zn() e.g., to set the RECLAIM_DISTANCE override.
init_amd_zn() also sets X86_FEATURE_ZEN, which today is only used
in amd_set_core_ssb_state(), which isn't called on some late
model Family 17h CPUs, nor on any Family 19h CPUs:
X86_FEATURE_AMD_SSBD replaces X86_FEATURE_LS_CFG_SSBD on those
later model CPUs, where the SSBD mitigation is done via the
SPEC_CTRL MSR instead of the LS_CFG MSR.
Family 19h CPUs also don't have the erratum where the CPB feature
bit isn't set, but that code can stay unchanged and run safely
on Family 19h.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200311191451.13221-1-kim.phillips@amd.com
Signed-off-by: Xin Hao <vernhao@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Hui Li <caelli@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit cf3f9a593dab87a032d2b6a6fb205e7f3de4f0a1 upstream.
When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.
If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.
Recognize this situation in advance and exit early.
Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tycho Andersen <tandersen@netflix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Commit <29e5375d7fcb5> ("add struct uncached_list->quarantine list")
[tapd]
https://tapd.woa.com/TS4Q/bugtrace/bugs/view/1020422414128561917
when del list entry in cycle create by list_for_each_entry
shuld use list_for_each_entry_safe or a deadlock may raise
fix <769ae85160> ("ipv4: add (struct uncached_list)->quarantine list")
Reviewed-by: Hui Li <caelli@tencent.com>
Signed-off-by: mengensun <mengensun@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Ampere added extra code for arm64 alignment fix,
debug and stacktrace was print after fix failure,
which may not suitable for production environment,
remove this.
Signed-off-by: caelli <caelli@tencent.com>
Reviewed-by: mengensun <mengensun@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit 28b4b05963 upstream.
The following error occurred during the fsstress test:
XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
The problem was that inode race condition causes incorrect i_nlink to be
written to disk, and then it is read into memory. Consider the following
call graph, inodes that are marked as both XFS_IFLUSHING and
XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
may be set to 1.
xfsaild
xfs_inode_item_push
xfs_iflush_cluster
xfs_iflush
xfs_inode_to_disk
xfs_iget
xfs_iget_cache_hit
xfs_iget_recycle
xfs_reinit_inode
inode_init_always
xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
inode state and can race with other RCU protected inode lookups. On the
read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
racing inode updates (during transactions) by that lock.
Fixes: ff7bebeb91 ("xfs: refactor the inode recycling code") # goes further back than this
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Hui Li <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit 7f63cf2d9b upstream.
Android 14 and later default to MGLRU [1] and field telemetry showed
occasional long tail latency (>100ms) in the reclaim path.
Tracing revealed priority inversion in the reclaim path. In
try_to_inc_max_seq(), when high priority tasks were blocked on
wait_event_killable(), the preemption of the low priority task to call
wake_up_all() caused those high priority tasks to wait longer than
necessary. In general, this problem is not different from others of its
kind, e.g., one caused by mutex_lock(). However, it is specific to MGLRU
because it introduced the new wait queue lruvec->mm_state.wait.
The purpose of this new wait queue is to avoid the thundering herd
problem. If many direct reclaimers rush into try_to_inc_max_seq(), only
one can succeed, i.e., the one to wake up the rest, and the rest who
failed might cause premature OOM kills if they do not wait. So far there
is no evidence supporting this scenario, based on how often the wait has
been hit. And this begs the question how useful the wait queue is in
practice.
Based on Minchan's recommendation, which is in line with his commit
6d4675e601 ("mm: don't be stuck to rmap lock on reclaim path") and the
rest of the MGLRU code which also uses trylock when possible, remove the
wait queue.
[1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf
Link: https://lkml.kernel.org/r/20230413214326.2147568-1-kaleshsingh@google.com
Fixes: bd74fdaea1 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Reported-by: Wei Wang <wvw@google.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Cc: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
In sysctl_restrict_open(), single_open() is used, so we should use the
corresponding single_release(), instead of seq_release(), otherwise it
will cause a memory leak.
Fixes: 6ac873e2f630 ("sysctl: restrict the writing for key parameters")
Signed-off-by: Juntong Deng <juntongdeng@tencent.com>
Signed-off-by: yilingjin <yilingjin@tencent.com>
Reviewed-by: Hui Li <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
add support for the Phytium QuadSPI controller driver
verifyed on Phytium D2000
Reviewed-by: Hongbo Mao <maohongbo@phytium.com.cn>
(cherry picked from commit bce2cf32f3)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Since current Phytium Px210 xHCI host controller does not support
USB Gen2(10Gbps) well, add the XHCI_SLOWDOWN_QUIRK quirk and
modify the limited speed to 5Gbps.
Signed-off-by: Feng Jun <fengjun@phytium.com.cn>
Signed-off-by: Chen Baozi <chenbaozi@phytium.com.cn>
Conflicts:
drivers/usb/host/xhci.h
(cherry picked from commit 50f3eecbba)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Add a driver for fan tachometer and capture counter of Phytium SoCs.
Signed-off-by: wangzhimin <wangzhimin1179@phytium.com.cn>
(cherry picked from commit cf09b9c0f0)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Add an standalone irqchip driver to handle Phytium PCI legacy interrupt.
When processing legacy INTx interrupts on some Phytium SoCs, the interrupt
status registers have be cleared by software explicitly. We introduce this
standalone irqchip which sits between the PCI legacy interrupt and the GIC,
applying hierarchical irqdomain to integrate the ack in the existing INTx
processing flow.
Signed-off-by: wangzhimin <wangzhimin1179@phytium.com.cn>
(cherry picked from commit aa66828d35)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
When mapping a LPI, the ITS driver picks the first possible
affinity, which is in most cases CPU0, assuming that if
that's not suitable, someone will come and set the affinity
to something more interesting.
It apparently isn't the case, and people complain of poor
performance when many interrupts are glued to the same CPU.
So let's place the interrupts by finding the "least loaded"
CPU (that is, the one that has the fewer LPIs mapped to it).
So called 'managed' interrupts are an interesting case where
the affinity is actually dictated by the kernel itself, and
we should honor this.
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1575642904-58295-1-git-send-email-john.garry@huawei.com
Link: https://lore.kernel.org/r/20200515165752.121296-3-maz@kernel.org
(cherry picked from commit 840775cad1)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
In order to improve the distribution of LPIs among CPUs, let start by
tracking the number of LPIs assigned to CPUs, both for managed and
non-managed interrupts (as separate counters).
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20200515165752.121296-2-maz@kernel.org
(cherry picked from commit aca60b181c)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
This patch adds the Phytium USBHS DRD controller support.
Signed-off-by: wangzhimin <wangzhimin1179@phytium.com.cn>
(cherry picked from commit 8e4291c455)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Add XHCI_RESET_ON_RESUME quirk for Phytium Pe220x
Phytium Pe220x xHCI host controller does not have suspend/resume
support. Therefore, use of the XHCI_RESET_ON_RESUME quirk is
mandatory in order to avoid failures after resume.
Signed-off-by: wangzhimin <wangzhimin1179@phytium.com.cn>
(cherry picked from commit 33d50ca483)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Add initial device tree for Phytium Pe220x SoCs. Phytium Pe220x
series has three specs (Pe2201/Pe2202/Pe2204), distinguished by
the number of CPU core. Besides CPU cores, on-chip peripherals
also vary. Thus, we split them into three separate DTBs.
Signed-off-by: wangzhimin <wangzhimin1179@phytium.com.cn>
(cherry picked from commit 1d5bb1d501)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
SPI platform driver for Phytium desktop CPUs, such as FT-2000/4 and D2000.
Reviewed-by: Hongbo Mao <maohongbo@phytium.com.cn>
(cherry picked from commit 555ae1361a)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
GPIO dirver fix patch. Support Phytiumm Desktop and Embedded CPUs, such as D2000 and E2000.
Reviewed-by: Hongbo Mao <maohongbo@phytium.com.cn>
(cherry picked from commit c2a5e2beae)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
SoC's on-chip RGMII.
Signed-off-by: wangzhimin <wangzhimin1179@phytium.com.cn>
(cherry picked from commit 4611897533)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
RTC driver function fix patch Support Phytium desktop and embedded processors, such as D2000 and E2000
Reviewed-by: Wang Nan <wangnan1505@phytium.com.cn>
(cherry picked from commit 00f079b210)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
serial driver function fix patch. Support Phytium desktop and embedded processors, such as D2000.
Reviewed-by:JianChao Sheng <shengjianchao@phytium.com.cn>
(cherry picked from commit 92e0f87830)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Mailbox driver function patch. Support Phytium desktop and embedded processors, such as D2000 and E2000.
Reviewed-by: Hongbo Mao <maohongbo@phytium.com.cn>
(cherry picked from commit f6e90a3999)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
I2C driver function fix patch. Support Phytium desktop and embedded processors, such as D2000 and E2000.
Conflicts:
drivers/i2c/busses/i2c-designware-platdrv.c
Reviewed-by: Hongbo Mao <maohongbo@phytium.com.cn>
(cherry picked from commit 5ec3d87d62)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
adds a simple device driver to expose the BT interface
on Phytium SoC as a character device. Such SOCs are commonly
used as BMCs (BaseBoard Management Controllers) and this driver
implements the BMC side of the BT interface.
(cherry picked from commit ccb23948d0)
Signed-off-by: Alex Shi <alexsshi@tencent.com>
Conflicts:
drivers/char/ipmi/Kconfig
drivers/char/ipmi/Makefile
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
fix the problem for setting bitrate on phytium D2000 Soc
Signed-off-by: Chen Siyu <chensiyu1321@phytium.com.cn>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit f342de4e2f33e0e39165d8639387aa6c19dff660 upstream.
This reverts commit e0abdadcc6.
core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP
verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar,
or 0.
Due to the reverted commit, its possible to provide a positive
value, e.g. NF_ACCEPT (1), which results in use-after-free.
Its not clear to me why this commit was made.
NF_QUEUE is not used by nftables; "queue" rules in nftables
will result in use of "nft_queue" expression.
If we later need to allow specifiying errno values from userspace
(do not know why), this has to call NF_DROP_GETERR and check that
"err <= 0" holds true.
Fixes: e0abdadcc6 ("netfilter: nf_tables: accept QUEUE/DROP verdict parameters")
Cc: stable@vger.kernel.org
Reported-by: Notselwyn <notselwyn@pwning.tech>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Fuhai Wang <fuhaiwang@tencent.com>
Signed-off-by: caelli <caelli@tencent.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
The commit 91dc13667e have change change
the struct of ethtool_ops, which will cause compile error.
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
add support for Wangxun 10GbE driver, source files and
functions are the same as wangxun out of box drivevr
release version txgbe-1.3.5.1.
Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
add support for wangxun 1GbE driver, source files and functions
are the same as wangxun oob ngbe-1.2.5.3.
Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Sync CONFIG_INFINIBAND_* from linux-5.4/lts/5.4.119-20.0009, ZhongXing NIC RDMA
needing infiniband driver.
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Update config with the commands as below:
make tencentconfig
make savedefconfig
mv defconfig arch/$ARCH/configs/tencent.config
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit 13400ac8fb80c57c2bfb12ebd35ee121ce9b4d21 ]
trie_get_next_key() allocates a node stack with size trie->max_prefixlen,
while it writes (trie->max_prefixlen + 1) nodes to the stack when it has
full paths from the root to leaves. For example, consider a trie with
max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ...
0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with
.prefixlen = 8 make 9 nodes be written on the node stack with size 8.
Fixes: b471f2f1de ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Tested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit 1d244784be6b01162b732a5a7d637dfc024c3203 ]
Percpu map is often used, but the map value size limit often ignored,
like issue: https://github.com/iovisor/bcc/issues/2519. Actually,
percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we
can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first,
like percpu map of local_storage. Maybe the error message seems clearer
compared with "cannot allocate memory".
Signed-off-by: Jinke Han <jinkehan@didiglobal.com>
Signed-off-by: Tao Chen <chen.dylane@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit 9b75dbeb36fcd9fc7ed51d370310d0518a387769 ]
When looking up an element in LPM trie, the condition 'matchlen ==
trie->max_prefixlen' will never return true, if key->prefixlen is larger
than trie->max_prefixlen. Consequently all elements in the LPM trie will
be visited and no element is returned in the end.
To resolve this, check key->prefixlen first before walking the LPM trie.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Florian Lehner <dev@der-flo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit 291d044fd51f8484066300ee42afecf8c8db7b3a upstream.
BPF_END and BPF_NEG has a different specification for the source bit in
the opcode compared to other ALU/ALU64 instructions, and is either
reserved or use to specify the byte swap endianness. In both cases the
source bit does not encode source operand location, and src_reg is a
reserved field.
backtrack_insn() currently does not differentiate BPF_END and BPF_NEG
from other ALU/ALU64 instructions, which leads to r0 being incorrectly
marked as precise when processing BPF_ALU | BPF_TO_BE | BPF_END
instructions. This commit teaches backtrack_insn() to correctly mark
precision for such case.
While precise tracking of BPF_NEG and other BPF_END instructions are
correct and does not need fixing, this commit opt to process all BPF_NEG
and BPF_END instructions within the same if-clause to better align with
current convention used in the verifier (e.g. check_alu_op).
Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Cc: stable@vger.kernel.org
Reported-by: Mohamed Mahmoud <mmahmoud@redhat.com>
Closes: https://lore.kernel.org/r/87jzrrwptf.fsf@toke.dk
Tested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Tested-by: Tao Lyu <tao.lyu@epfl.ch>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20231102053913.12004-2-shung-hsi.yu@suse.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit a34a9f1a19 ]
Sysbot discovered that the queue and stack maps can deadlock if they are
being used from a BPF program that can be called from NMI context (such as
one that is attached to a perf HW counter event). To fix this, add an
in_nmi() check and use raw_spin_trylock() in NMI context, erroring out if
grabbing the lock fails.
Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Reported-by: Hsin-Wei Hung <hsinweih@uci.edu>
Tested-by: Hsin-Wei Hung <hsinweih@uci.edu>
Co-developed-by: Hsin-Wei Hung <hsinweih@uci.edu>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230911132815.717240-1-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
[ Upstream commit ee9fd0ac30 ]
KCSAN reported a data-race when accessing node->ref.
Although node->ref does not have to be accurate,
take this chance to use a more common READ_ONCE() and WRITE_ONCE()
pattern instead of data_race().
There is an existing bpf_lru_node_is_ref() and bpf_lru_node_set_ref().
This patch also adds bpf_lru_node_clear_ref() to do the
WRITE_ONCE(node->ref, 0) also.
==================================================================
BUG: KCSAN: data-race in __bpf_lru_list_rotate / __htab_lru_percpu_map_update_elem
write to 0xffff888137038deb of 1 bytes by task 11240 on cpu 1:
__bpf_lru_node_move kernel/bpf/bpf_lru_list.c:113 [inline]
__bpf_lru_list_rotate_active kernel/bpf/bpf_lru_list.c:149 [inline]
__bpf_lru_list_rotate+0x1bf/0x750 kernel/bpf/bpf_lru_list.c:240
bpf_lru_list_pop_free_to_local kernel/bpf/bpf_lru_list.c:329 [inline]
bpf_common_lru_pop_free kernel/bpf/bpf_lru_list.c:447 [inline]
bpf_lru_pop_free+0x638/0xe20 kernel/bpf/bpf_lru_list.c:499
prealloc_lru_pop kernel/bpf/hashtab.c:290 [inline]
__htab_lru_percpu_map_update_elem+0xe7/0x820 kernel/bpf/hashtab.c:1316
bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313
bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200
generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687
bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534
__sys_bpf+0x338/0x810
__do_sys_bpf kernel/bpf/syscall.c:5096 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5094 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff888137038deb of 1 bytes by task 11241 on cpu 0:
bpf_lru_node_set_ref kernel/bpf/bpf_lru_list.h:70 [inline]
__htab_lru_percpu_map_update_elem+0x2f1/0x820 kernel/bpf/hashtab.c:1332
bpf_percpu_hash_update+0x5e/0x90 kernel/bpf/hashtab.c:2313
bpf_map_update_value+0x2a9/0x370 kernel/bpf/syscall.c:200
generic_map_update_batch+0x3ae/0x4f0 kernel/bpf/syscall.c:1687
bpf_map_do_batch+0x2d9/0x3d0 kernel/bpf/syscall.c:4534
__sys_bpf+0x338/0x810
__do_sys_bpf kernel/bpf/syscall.c:5096 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5094 [inline]
__x64_sys_bpf+0x43/0x50 kernel/bpf/syscall.c:5094
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x01 -> 0x00
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 11241 Comm: syz-executor.3 Not tainted 6.3.0-rc7-syzkaller-00136-g6a66fdd29ea1 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023
==================================================================
Reported-by: syzbot+ebe648a84e8784763f82@syzkaller.appspotmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20230511043748.1384166-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
commit 0613d8ca9a upstream.
A narrow load from a 64-bit context field results in a 64-bit load
followed potentially by a 64-bit right-shift and then a bitwise AND
operation to extract the relevant data.
In the case of a 32-bit access, an immediate mask of 0xffffffff is used
to construct a 64-bit BPP_AND operation which then sign-extends the mask
value and effectively acts as a glorified no-op. For example:
0: 61 10 00 00 00 00 00 00 r0 = *(u32 *)(r1 + 0)
results in the following code generation for a 64-bit field:
ldr x7, [x7] // 64-bit load
mov x10, #0xffffffffffffffff
and x7, x7, x10
Fix the mask generation so that narrow loads always perform a 32-bit AND
operation:
ldr x7, [x7] // 64-bit load
mov w10, #0xffffffff
and w7, w7, w10
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
Cc: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>