Commit Graph

801122 Commits

Author SHA1 Message Date
Duanqiang Wen 63c75247db anolis: net: txgbe: fix mailbox error when echo vf
ANBZ: #8072

when vf driver make modules_install,
if echo vf of different ports successively,
there will be problems with the mailbox lock.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 31a26be1ba anolis: net: txgbe: fix ethtool set rss indir table
ANBZ: #8072

ethtool -X ethx equal/weight, can't update
ethx rss indir table.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 058e67ee39 anolis: net: txgbe: show max_combined wrong when enable sriov
ANBZ: #8072

when enable sriov, ethtool -l ethx to get port
pre-set max_combined always be 1, but in reality
max_combined depends on num_vfs.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 4f08963061 anolis: net: txgbevf: support for vf rss
ANBZ: #8072

support to virtual function rss features.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 2dda2f6f04 anolis: net: txgbevf: fix vf queues maximum
ANBZ: #8072

fix vf only supports a maximum of 2 queues,
get maximum of queues for pf msg[TXGBE_VF_RX_QUEUES]
and msg[TXGBE_VF_TX_QUEUES]. set maximum of queues to
4.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen bbaa246502 anolis: net: txgbe: support to add ether type filter by ethtool
ANBZ: #8072

support to add ether type filter by ethtool and fix vf rss function.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 05464c990e anolis: net: txgbe: add led blink support for oem id 0x1ff9
ANBZ: #8072

ethtool -p ethx, add support for oem id 0x1ff9.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 65b7bbfdcd anolis: net: txgbe: fix ethtool -t loopback test failed
ANBZ: #8072

in version 5042000f firmware, ethtool -t ethx will failed.
because diag_test will clear driver load bit and lan reset,
and then firmware will configuration pcs, it will cause loopback
test failed.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 48e24cb036 anolis: net: txgbe: support copper modules and DAC
ANBZ: #8072

add support for detecting copper module link status
and DAC cable.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen e8e8e274be anolis: net: txgbe: fix cannot link to 10G
ANBZ: #8072

fix ethool -s ethx autoneg on cannot link to 10G,
step:
1.ethtool -s eth0 speed 1000 duplex full autoneg on
2.unplug eth0 fiber
3.ethtool -s eth0 autoneg on
4.plug fiber
link speed is 1G.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 1f2ec94aa0 anolis: net: txgbe: return error code for unsupported parameters
ANBZ: #8072

When do ethtool -C, if not change any coalesce parameters supported
will return -EINVAL.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen ccb245692a anolis: net: txgbe: support for 802.1ad
ANBZ: #8072

add support for 802.1ad vlan, offload setting
follows 802.1q vlan setting.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 6867fd682f anolis: net: txgbe: add support to show fw version on vf
ANBZ: #8072

add support for show fw version on vf driver.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen f435e65a3b anolis: net: txgbe:set i2c_speed to standard mode
ANBZ: #8072

real i2c speed larger than standard mode, set i2c
speed to standard mode.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 1c68adced0 anolis: net: txgbe: sriov mode can't enable lro
ANBZ: #8072

after enabling sriov mode, ethtool -k to set ntuple on
will return requested on, and setting is not effective.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 2410a678b3 anolis: txgbevf: fix make allyesconfig build failed
ANBZ: #8072

make allyesconfig, build failed because multiple
definition, change txgbevf module function names.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 26fab43533 anolis: net: txgbe: support for pf change ntuple setting
ANBZ: #8072

ethtool -k ethx show ntuple setting is fixed,
add support fot changing ntuple setting.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 544be4200a anolis: net: txgbe: fix set vf ntuple rule cannot work
ANBZ: #8072

pf ethtool operation for ntuple setting, is not
supported for vf, add support for ntuple rules to
flow packets to vf queue.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 93c4c80423 anolis: net: wangxun: change driver version
ANBZ: #8072

append driver version with anolis,
it is helpful to distinguish inbox driver
and out of tree driver.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen b3718d5c2e anolis: net: txgbe: fix qinq or double vlan tso is not work
ANBZ: #8072

add ndo_fetures_check for double vlan or qinq to fix
tso bug.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen 2486231a5a anolis: net: txgbe: fix different vlanid can send to virtual function
ANBZ: #8072

pf ack vf vlan setting, didn't check vid in active_vlan.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Duanqiang Wen c806ad788a anolis: config: default to build txgbevf for module
ANBZ: #8072

default to build txgbevf for module, only in
x86 and arm64 arch.

Signed-off-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
DuanqiangWen ddb0a323cf anolis: net: txgbevf: add support for power management
ANBZ: #8072

add support for power management interface,
for suspending and resuming nic.

Signed-off-by: DuanqiangWen <duanqiangwen@net-swift.com>
Reviewed-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
DuanqiangWen 34cc16beef anolis: net: anolis: add support for ethtool
ANBZ: #8072

add support for ethtool, use ethtool
can get some virtual function information.

Signed-off-by: DuanqiangWen <duanqiangwen@net-swift.com>
Reviewed-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
DuanqiangWen 19411b102a anolis: net: txgbevf: add support for tx/rx traffic
ANBZ: #8072

add xmit and receive codes for virtual function.

Signed-off-by: DuanqiangWen <duanqiangwen@net-swift.com>
Reviewed-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
DuanqiangWen 5a6f19204f anolis: net: txgbevf: add hardware initialization
ANBZ: #8072

initialize hardware, including vf mac layer,
mailbox interface.

Signed-off-by: DuanqiangWen <duanqiangwen@net-swift.com>
Reviewed-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
DuanqiangWen 59cb85c93e anolis: net: txgbevf: Add build support for txgbevf
ANBZ: #8072

Add doc build infrastructure for txgbevf driver.
Initialize PCI memory space for WangXun 10 Gigabit
virtual function Ethernet devices.

Signed-off-by: DuanqiangWen <duanqiangwen@net-swift.com>
Reviewed-by: Duanqiang Wen <duanqiangwen@net-swift.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3121
2024-05-27 07:30:09 +00:00
Ricardo B. Marliere 8d56d12294 media: pvrusb2: fix use after free on context disconnection
ANBZ: #8555

commit ded85b0c0e upstream.

Upon module load, a kthread is created targeting the
pvr2_context_thread_func function, which may call pvr2_context_destroy
and thus call kfree() on the context object. However, that might happen
before the usb hub_event handler is able to notify the driver. This
patch adds a sanity check before the invalid read reported by syzbot,
within the context disconnection call stack.

Reported-and-tested-by: syzbot+621409285c4156a009b3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000a02a4205fff8eb92@google.com/

Fixes: e5be15c638 ("V4L/DVB (7711): pvrusb2: Fix race on module unload")
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Acked-by: Mike Isely <isely@pobox.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>

Fixes: CVE-2023-52445
Signed-off-by: Xiao Long <xiaolong@openanolis.org>
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3021
2024-05-22 12:03:38 +00:00
Joseph Qi 266e564f30 anolis: check cgroup v1 for memcg_blkcg_tree operations
ANBZ: #8973

Currently parameter 'cgwb_v1' can be setup unconditionally.

Take the following abnormal case into consideration:
System administrator configures both 'cgwb_v1' and
'systemd.unified_cgroup_hierarchy=1' in command line by mistake, so we
use cgroup v2 after boot in fact. Though we'll check if current kernel
is under cgroup v2 in inode_cgwb_enabled(), we still allocate, insert
and delete links for memcg_blkcg_tree since we only check parameter
'cgwb_v1'.

This seems no actual harm, but it is entirely unnecessary and wasty. So
restrict these operations only under cgroup v1. Since bdi initialization
is before enabling cgroup subsys, so we'll still create debug file
bdi_wb_link but without any links in above abnormal case.

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3147
2024-05-09 15:54:37 +08:00
Mao Wenan a665acc125 af_packet: set defaule value for tmo
ANBZ: #8733

[ Upstream commit b43d1f9f70 ]

There is softlockup when using TPACKET_V3:
...
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
(__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqrestore+0x44/0x54)
(_raw_spin_unlock_irqrestore) from [<c027b7e8>] (mod_timer+0x210/0x25c)
(mod_timer) from [<c0549c30>]
(prb_retire_rx_blk_timer_expired+0x68/0x11c)
(prb_retire_rx_blk_timer_expired) from [<c027a7ac>]
(call_timer_fn+0x90/0x17c)
(call_timer_fn) from [<c027ab6c>] (run_timer_softirq+0x2d4/0x2fc)
(run_timer_softirq) from [<c021eaf4>] (__do_softirq+0x218/0x318)
(__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac)
(irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4)
(msa_irq_exit) from [<c0209cf0>] (handle_IPI+0x650/0x7f4)
(handle_IPI) from [<c02015bc>] (gic_handle_irq+0x108/0x118)
(gic_handle_irq) from [<c0558ee4>] (__irq_usr+0x44/0x5c)
...

If __ethtool_get_link_ksettings() is failed in
prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
is zero and the timer expire for retire_blk_timer is turn to
mod_timer(&pkc->retire_blk_timer, jiffies + 0),
which will trigger cpu usage of softirq is 100%.

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Tested-by: Xiao Jiangfeng <xiaojiangfeng@huawei.com>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tom Yang <yangqixiao@inspur.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3036
2024-05-06 07:57:11 +00:00
Cruz Zhao 9a4dfd6c08 anolis: sched/fair: fix underclass unscheduled after ID_ABSOLUTE_EXPEL turned off
ANBZ: #8821

In function sync_min_vruntime(), expel_start and expel_spread will not
be cleared if CONFIG_SCHED_SMT is off, which results that the priority
of underclass will be much lower than other once ID_ABSOLUTE_EXPEL is
turned off, because min_vruntime << min_under_vruntime + expel_spread
after sync_min_vruntime().

To fix this problem, we clear expel_start and expel_spread in
sync_min_vruntime() regardless of whether CONFIG_SCHED_SMT is on.

Fixes: 139aefab8eaa("anolis: sched/fair: introduce sched_feat ID_ABSOLUTE_EXPEL")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3104
2024-04-29 02:31:42 +00:00
Philo Lu 18288b4ee8 anolis: Revert "anolis: virtio-net: open napi for tx"
ANBZ: #8910

This reverts commit 8927cac904.

Few regressions are found in benchmarks, so we decide to disable napi_tx
by default, which is in consistence with old versions before, to keep
the performance stable.

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3106
2024-04-28 21:08:34 +08:00
Cruz Zhao c805900600 anolis: sched/fair: fix invalid ID_ABSOLUTE_EXPEL without CONFIG_SCHED_SMT
ANBZ: #8821

If CONFIG_SCHED_SMT is turned off, ID_ABSOLUTE_EXPELL will be invalid,
because update_expel_start() is a NULL function, resulting that there's no
chance to adjust vruntime to make underclass lag, and underclass tasks
will get a chance to run, unexpectedly.

To fix this problem, we just change the logic of id_vruntime_before(),
letting the vruntime of highclass and normal sched_entity always be
before underclass sched_entity.

Fixes: 139aefab8eaa("anolis: sched/fair: introduce sched_feat ID_ABSOLUTE_EXPEL")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3085
2024-04-24 02:08:55 +00:00
Johannes Weiner d62bc059c9 mm: fix false-positive OVERCOMMIT_GUESS failures
ANBZ: #8860

commit 8c7829b04c upstream

With the default overcommit==guess we occasionally run into mmap
rejections despite plenty of memory that would get dropped under
pressure but just isn't accounted reclaimable. One example of this is
dying cgroups pinned by some page cache. A previous case was auxiliary
path name memory associated with dentries; we have since annotated
those allocations to avoid overcommit failures (see d79f7aa496 ("mm:
treat indirectly reclaimable memory as free in overcommit logic")).

But trying to classify all allocated memory reliably as reclaimable
and unreclaimable is a bit of a fool's errand. There could be a myriad
of dependencies that constantly change with kernel versions.

It becomes even more questionable of an effort when considering how
this estimate of available memory is used: it's not compared to the
system-wide allocated virtual memory in any way. It's not even
compared to the allocating process's address space. It's compared to
the single allocation request at hand!

So we have an elaborate left-hand side of the equation that tries to
assess the exact breathing room the system has available down to a
page - and then compare it to an isolated allocation request with no
additional context. We could fail an allocation of N bytes, but for
two allocations of N/2 bytes we'd do this elaborate dance twice in a
row and then still let N bytes of virtual memory through. This doesn't
make a whole lot of sense.

Let's take a step back and look at the actual goal of the
heuristic. From the documentation:

   Heuristic overcommit handling. Obvious overcommits of address
   space are refused. Used for a typical system. It ensures a
   seriously wild allocation fails while allowing overcommit to
   reduce swap usage.  root is allowed to allocate slightly more
   memory in this mode. This is the default.

If all we want to do is catch clearly bogus allocation requests
irrespective of the general virtual memory situation, the physical
memory counter-part doesn't need to be that complicated, either.

When in GUESS mode, catch wild allocations by comparing their request
size to total amount of ram and swap in the system.

Link: http://lkml.kernel.org/r/20190412191418.26333-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3088
2024-04-23 17:33:25 +08:00
Guixin Liu 35d297d6b4 anolis: net: directly copy page instead of map page
ANBZ: #8749

If __skb_datagram_iter's cb parm is simple_copy_to_iter, we dont need
to map page first, just use copy_page_to_iter to copy page directly.
And also remove simple_copy_to_iter().

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3050
2024-04-18 05:55:54 +00:00
Dust Li a5de83dd72 anolis: mlx5: fix double rcu_read_lock() in mlx5_eq_cq_get()
ANBZ: #8774

when backporting upstream commit
1fbf1252df0e42("mlx5: use RCU lock in mlx5_eq_cq_get()"),
we miss used the rcu_read_lock() twice, without unlock.

Fixes: 1335c7384274("mlx5: use RCU lock in mlx5_eq_cq_get()")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3061
2024-04-17 10:14:14 +00:00
Cruz Zhao 83842b76dd anolis: sched/fair: optimize ID_LOAD_BALANCE to rescue underclass
ANBZ: #8758

ID_LOAD_BALANCE tends to migrate highclass and normal tasks first to
prevent cpu competition among them, which will result that underclass
tasks lose the migration opportunity with a high probability, even
when they are expelled.

To optimize ID_LOAD_BALANCE, we will redo load balance if there is
still imbalance, and in the second loop we will allow migrating
underclass tasks.

Fixes: 9fa7c9d6eb14("anolis: sched/fair: introduce sched_feat ID_LOAD_BALANCE")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3054
2024-04-17 07:16:33 +00:00
Liguang Zhang 56cf8b6f8b PCI: pciehp: Clear cmd_busy bit in polling mode
ANBZ: #8731

commit 92912b1751 upstream.

Writes to a Downstream Port's Slot Control register are PCIe hotplug
"commands."  If the Port supports Command Completed events, software must
wait for a command to complete before writing to Slot Control again.

pcie_do_write_cmd() sets ctrl->cmd_busy when it writes to Slot Control.  If
software notification is enabled, i.e., PCI_EXP_SLTCTL_HPIE and
PCI_EXP_SLTCTL_CCIE are set, ctrl->cmd_busy is cleared by pciehp_isr().

But when software notification is disabled, as it is when pcie_init()
powers off an empty slot, pcie_wait_cmd() uses pcie_poll_cmd() to poll for
command completion, and it neglects to clear ctrl->cmd_busy, which leads to
spurious timeouts:

  pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x01c0 (issued 2264 msec ago)
  pcieport 0000:00:03.0: pciehp: Timeout on hotplug command 0x05c0 (issued 2288 msec ago)

Clear ctrl->cmd_busy in pcie_poll_cmd() when it detects a Command Completed
event (PCI_EXP_SLTSTA_CC).

[bhelgaas: commit log]
Fixes: a5dd4b4b05 ("PCI: pciehp: Wait for hotplug command completion where necessary")
Link: https://lore.kernel.org/r/20211111054258.7309-1-zhangliguang@linux.alibaba.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215143
Link: https://lore.kernel.org/r/20211126173309.GA12255@wunner.de
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org	# v4.19+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: chuguangqing <chuguangqing@inspur.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3033
2024-04-10 03:10:56 +00:00
Lukas Wunner 34d0ea664c PCI: pciehp: Fix infinite loop in IRQ handler upon power fault
ANBZ: #8731

commit 23584c1ed3 upstream.

The Power Fault Detected bit in the Slot Status register differs from
all other hotplug events in that it is sticky:  It can only be cleared
after turning off slot power.  Per PCIe r5.0, sec. 6.7.1.8:

  If a power controller detects a main power fault on the hot-plug slot,
  it must automatically set its internal main power fault latch [...].
  The main power fault latch is cleared when software turns off power to
  the hot-plug slot.

The stickiness used to cause interrupt storms and infinite loops which
were fixed in 2009 by commits 5651c48cfa ("PCI pciehp: fix power fault
interrupt storm problem") and 99f0169c17 ("PCI: pciehp: enable
software notification on empty slots").

Unfortunately in 2020 the infinite loop issue was inadvertently
reintroduced by commit 8edf5332c3 ("PCI: pciehp: Fix MSI interrupt
race"):  The hardirq handler pciehp_isr() clears the PFD bit until
pciehp's power_fault_detected flag is set.  That happens in the IRQ
thread pciehp_ist(), which never learns of the event because the hardirq
handler is stuck in an infinite loop.  Fix by setting the
power_fault_detected flag already in the hardirq handler.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214989
Link: https://lore.kernel.org/linux-pci/DM8PR11MB5702255A6A92F735D90A4446868B9@DM8PR11MB5702.namprd11.prod.outlook.com
Fixes: 8edf5332c3 ("PCI: pciehp: Fix MSI interrupt race")
Link: https://lore.kernel.org/r/66eaeef31d4997ceea357ad93259f290ededecfd.1637187226.git.lukas@wunner.de
Reported-by: Joseph Bao <joseph.bao@intel.com>
Tested-by: Joseph Bao <joseph.bao@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # v4.19+
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
[sudip: adjust context]
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: chuguangqing <chuguangqing@inspur.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3033
2024-04-10 03:10:56 +00:00
Stuart Hayes 2b963e6ed5 PCI: pciehp: Fix MSI interrupt race
ANBZ: #8731

[ Upstream commit 8edf5332c3 ]

Without this commit, a PCIe hotplug port can stop generating interrupts on
hotplug events, so device adds and removals will not be seen:

The pciehp interrupt handler pciehp_isr() reads the Slot Status register
and then writes back to it to clear the bits that caused the interrupt.  If
a different interrupt event bit gets set between the read and the write,
pciehp_isr() returns without having cleared all of the interrupt event
bits.  If this happens when the MSI isn't masked (which by default it isn't
in handle_edge_irq(), and which it will never be when MSI per-vector
masking is not supported), we won't get any more hotplug interrupts from
that device.

That is expected behavior, according to the PCIe Base Spec r5.0, section
6.7.3.4, "Software Notification of Hot-Plug Events".

Because the Presence Detect Changed and Data Link Layer State Changed event
bits can both get set at nearly the same time when a device is added or
removed, this is more likely to happen than it might seem.  The issue was
found (and can be reproduced rather easily) by connecting and disconnecting
an NVMe storage device on at least one system model where the NVMe devices
were being connected to an AMD PCIe port (PCI device 0x1022/0x1483).

Fix the issue by modifying pciehp_isr() to loop back and re-read the Slot
Status register immediately after writing to it, until it sees that all of
the event status bits have been cleared.

[lukas: drop loop count limitation, write "events" instead of "status",
don't loop back in INTx and poll modes, tweak code comment & commit msg]
Link: https://lore.kernel.org/r/78b4ced5072bfe6e369d20e8b47c279b8c7af12e.1582121613.git.lukas@wunner.de
Tested-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: chuguangqing <chuguangqing@inspur.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3033
2024-04-10 03:10:56 +00:00
Coly Li bece57dd9b bcache: avoid NULL checking to c->root in run_cache_set()
ANBZ: #8720

commit 3eba5e0b24 upstream.

In run_cache_set() after c->root returned from bch_btree_node_get(), it
is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL
because bch_btree_node_get() will not return NULL pointer to caller.

This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason.

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-11-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Coly Li bbcf50d69f bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc()
ANBZ: #8720

commit 31f5b956a1 upstream.

This patch adds code comments to bch_btree_node_get() and
__bch_btree_node_alloc() that NULL pointer will not be returned and it
is unnecessary to check NULL pointer by the callers of these routines.

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-10-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Coly Li b8b4c2407f bcache: avoid oversize memory allocation by small stripe_size
ANBZ: #8720

commit baf8fb7e0e upstream.

Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
used for dirty data writeback, their sizes are decided by backing device
capacity and stripe size. Larger backing device capacity or smaller
stripe size make these two arraies occupies more dynamic memory space.

Currently bcache->stripe_size is directly inherited from
queue->limits.io_opt of underlying storage device. For normal hard
drives, its limits.io_opt is 0, and bcache sets the corresponding
stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
devices do declare value for queue->limits.io_opt, small stripe_size
(comparing to 1TB) becomes an issue for oversize memory allocations of
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
capacity of hard drives gets much larger in recent decade.

For example a raid5 array assembled by three 20TB hardrives, the raid
device capacity is 40TB with typical 512KB limits.io_opt. After the math
calculation in bcache code, these two arraies will occupy 400MB dynamic
memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
declared on a new 2TB hard drive, then these two arraies request 2GB and
512MB dynamic memory from kzalloc(). The result is that bcache device
always fails to initialize on his system.

To avoid the oversize memory allocation, bcache->stripe_size should not
directly inherited by queue->limits.io_opt from the underlying device.
This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
and set bcache device's stripe size against the declared limits.io_opt
value from the underlying storage device,
- If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size directly by this limits.io_opt value.
- If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size by a value multiplying limits.io_opt and euqal or
  large than BCH_MIN_STRIPE_SZ.

Then the minimal stripe size of a bcache device will always be >= 4MB.
For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
by these two arraies will be 2.5MB in total.

Such mount of memory allocated for bcache->stripe_sectors_dirty and
bcache->full_dirty_stripes is reasonable for most of storage devices.

Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@lists.ewheeler.net>
Link: https://lore.kernel.org/r/20231120052503.6122-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Markus Weippert d871f7434d bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
ANBZ: #8720

commit bb6cc25386 upstream.

Commit 028ddcac47 ("bcache: Remove unnecessary NULL point check in
node allocations") replaced IS_ERR_OR_NULL by IS_ERR. This leads to a
NULL pointer dereference.

BUG: kernel NULL pointer dereference, address: 0000000000000080
Call Trace:
 ? __die_body.cold+0x1a/0x1f
 ? page_fault_oops+0xd2/0x2b0
 ? exc_page_fault+0x70/0x170
 ? asm_exc_page_fault+0x22/0x30
 ? btree_node_free+0xf/0x160 [bcache]
 ? up_write+0x32/0x60
 btree_gc_coalesce+0x2aa/0x890 [bcache]
 ? bch_extent_bad+0x70/0x170 [bcache]
 btree_gc_recurse+0x130/0x390 [bcache]
 ? btree_gc_mark_node+0x72/0x230 [bcache]
 bch_btree_gc+0x5da/0x600 [bcache]
 ? cpuusage_read+0x10/0x10
 ? bch_btree_gc+0x600/0x600 [bcache]
 bch_gc_thread+0x135/0x180 [bcache]

The relevant code starts with:

    new_nodes[0] = NULL;

    for (i = 0; i < nodes; i++) {
        if (__bch_keylist_realloc(&keylist, bkey_u64s(&r[i].b->key)))
            goto out_nocoalesce;
    // ...
out_nocoalesce:
    // ...
    for (i = 0; i < nodes; i++)
        if (!IS_ERR(new_nodes[i])) {  // IS_ERR_OR_NULL before
028ddcac47
            btree_node_free(new_nodes[i]);  // new_nodes[0] is NULL
            rw_unlock(true, new_nodes[i]);
        }

This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this.

Fixes: 028ddcac47 ("bcache: Remove unnecessary NULL point check in node allocations")
Link: https://lore.kernel.org/all/3DF4A87A-2AC1-4893-AE5F-E921478419A9@suse.de/
Cc: stable@vger.kernel.org
Cc: Zheng Wang <zyytlz.wz@163.com>
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Markus Weippert <markus@gekmihesg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Rand Deeb 4a7d815e44 bcache: prevent potential division by zero error
ANBZ: #8720

commit 2c7f497ac2 upstream.

In SHOW(), the variable 'n' is of type 'size_t.' While there is a
conditional check to verify that 'n' is not equal to zero before
executing the 'do_div' macro, concerns arise regarding potential
division by zero error in 64-bit environments.

The concern arises when 'n' is 64 bits in size, greater than zero, and
the lower 32 bits of it are zeros. In such cases, the conditional check
passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to
'uint32_t,' effectively truncating it to its lower 32 bits.
Consequently, the 'n' value becomes zero.

To fix this potential division by zero error and ensure precise
division handling, this commit replaces the 'do_div' macro with
div64_u64(). div64_u64() is designed to work with 64-bit operands,
guaranteeing that division is performed correctly.

This change enhances the robustness of the code, ensuring that division
operations yield accurate results in all scenarios, eliminating the
possibility of division by zero, and improving compatibility across
different 64-bit environments.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Rand Deeb <rand.sec96@gmail.com>
Cc:  <stable@vger.kernel.org>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Coly Li 6cdca0c05a bcache: check return value from btree_node_alloc_replacement()
ANBZ: #8720

commit 777967e7e9 upstream.

In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns
from btree_gc_rewrite_node(). There is potential possibility that 'n' is
a non NULL ERR_PTR(), referencing such error code is not permitted in
following code. Therefore a return value checking is necessary after 'n'
is back from btree_node_alloc_replacement().

Signed-off-by: Coly Li <colyli@suse.de>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc:  <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20231120052503.6122-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Coly Li 82eb5f3436 bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce()
ANBZ: #8720

commit f72f4312d4 upstream.

Commit 028ddcac47 ("bcache: Remove unnecessary NULL point check in
node allocations") do the following change inside btree_gc_coalesce(),

31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce(
32         memset(new_nodes, 0, sizeof(new_nodes));
33         closure_init_stack(&cl);
34
35 -       while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
36 +       while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b))
37                 keys += r[nodes++].keys;
38
39         blocks = btree_default_blocks(b->c) * 2 / 3;

At line 35 the original r[nodes].b is not always allocatored from
__bch_btree_node_alloc(), and possibly initialized as NULL pointer by
caller of btree_gc_coalesce(). Therefore the change at line 36 is not
correct.

This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid
potential issue.

Fixes: 028ddcac47 ("bcache: Remove unnecessary NULL point check in node allocations")
Cc:  <stable@vger.kernel.org> # 6.5+
Cc: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20231120052503.6122-9-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Zheng Wang ba81754866 bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent
ANBZ: #8720

commit 80fca8a10b upstream.

In some specific situations, the return value of __bch_btree_node_alloc
may be NULL. This may lead to a potential NULL pointer dereference in
caller function like a calling chain :
btree_split->bch_btree_node_alloc->__bch_btree_node_alloc.

Fix it by initializing the return value in __bch_btree_node_alloc.

Fixes: cafe563591 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Wang <zyytlz.wz@163.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Coly Li df5923bd45 bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()
ANBZ: #8720

commit 17e4aed830 upstream.

The parameter 'int n' from bch_bucket_alloc_set() is not cleared
defined. From the code comments n is the number of buckets to alloc, but
from the code itself 'n' is the maximum cache to iterate. Indeed all the
locations where bch_bucket_alloc_set() is called, 'n' is alwasy 1.

This patch removes the confused and unnecessary 'int n' from parameter
list of  bch_bucket_alloc_set(), and explicitly allocates only 1 bucket
for its caller.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 80fca8a10b ("bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00
Shenghui Wang d38337a3fb bcache: use MAX_CACHES_PER_SET instead of magic number 8 in __bch_bucket_alloc_set
ANBZ: #8720

commit 8792099f9a upstream.

Current cache_set has MAX_CACHES_PER_SET caches most, and the macro
is used for
"
	struct cache *cache_by_alloc[MAX_CACHES_PER_SET];
"
in the define of struct cache_set.

Use MAX_CACHES_PER_SET instead of magic number 8 in
__bch_bucket_alloc_set.

Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 80fca8a10b ("bcache: Fix __bch_btree_node_alloc to make the failure behavior consistent")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3034
2024-04-10 02:34:36 +00:00