Support readwrite unified configuration, so we can more
easily configure the bps/iops of the cgroup.
Add readwrite_dynamic_ratio interface. Support anticipate ratio
via previous data to control readwrite block throttle dynamically.
Anticipate readwrite ratio based on dispatched bytes/iops in the
last slice. Considering read and write slice are not aligned and
able to trim orextent. Use the elapsed slice numbers to get an
approximate rate.
Tencent-internal-TAPDID: 878345747
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
Reviewed-by: Hongbo Li <herberthbli@tencent.com>
Add entry of iocost and iolatency for cgroup v1
The effective weight of iocost sometimes may differs from the weight
that users configured. This patch displays useful information for each
cgroup's blk.cost.stat.
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
Add a sysctl switch to control buffer IO counting in
memcg of cgroup v1. If turn on this switch, remove
memory cgroup may leave zombie slabs until wb finished.
Need to turn on io_qos and io_cgv1_buff_wb in cgroup v1.
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
Introduce the per cgroup.sync interface, so that we can ensure
that the dirty pages of the cgroup are actually written to the
disk without considering the dirty pages generated elsewhere.
This can avoid the problem of large cgroup exit delay caused
by system-level sync and avoid the problem of IO jitter.
Note:
struct wb_writeback_work moved from fs/fs-writeback.c to
include/linux/writeback.h
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Add buffer IO isolation bind_blkio based on v2 infrastructure to v1,
so we can unify the interface for dio and bufio.
Add sysctl switch to allow migrate already bind cgroup.
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
Add buffer IO throttle for cgroup v1 base on dirty throttle,
Since the actual IO speed is not considered, this solution
may cause the continuous accumulation of dirty pages in the
IO performance bottleneck scenario, which will lead to the
deterioration of the isolation effect.
Note:
struct blkcg moved from block/blk-cgroup.h to
include/linux/blk-cgroup.h
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Use sysctl_io_qos as rue IO function switch.
Also support blk throttle hierarchy and enable by default.
Note:
throttle hierarchy won't effected by kernel.io_qos since it
is linked in initialized phase
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
Upstream: no
For non-SMP build, also alloc the dkstats dynamiclly. however,
wrong struct type is assigned.
Fixes: 6dfa517032 ("blkcg/diskstats: add per blkcg diskstats support")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Blkcg add recursive diskstats.
Fix the issue print the last partition in original solution and
remove the list.
Note:
This function just for backward compatible of tkernel4. Since
commit f733164829 ("blk-cgroup: reimplement basic IO stats
using cgroup rstat") implement blkg_iostat_set for cgroup stat
in blkcg_gq.
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
Make block cgroup I/O completion and done function dynamic
to account per cgroup I/O status in ebpf.
Fix blkcg_dkstats.alloc_node not undefined blkcg_dkstats.alloc_node
only available when CONFIG_SMP enabled, move the INIT to the right
place.
Export blkcg symbols to be used in bpf accounting.
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
Add sysctl_vm_use_priority_oom as a global setting to enable the
priority_oom setting for all cgroups without the need to manually
set it for each cgroup. This global setting has no effect when it
is turned off.
Signed-off-by: Haojie Ning <paulning@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
The pagecache limit for system and per-cgroup will
cause the process to get stuck when mglru is enabled.
Use lru_gen_enabled() to check whether mglru is
enabled in the system.
Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
When memcg was removed, page caches and slab pages still
reference to this memcg, it will cause very large number
of dying memcgs in out system. This feature can async to
clean dying memcgs in system.
1) sysctl -w vm.clean_dying_memcg_async=1
#start a kthread to async clean dying memcgs, default
#value is 0.
2) sysctl -w vm.clean_dying_memcg_threshold=10
#Whenever 10 dying memcgs are generated in the system,
#wakeup a kthread to async clean dying memcgs, default
#value is 100.
Signed-off-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
A new memory.page_cache_hit control file is added
under each memory cgroup directory. Cat this file can
print page cache hit and miss ratio at the memory
cgroup level.
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
A new memory.latency_histogram control file is added
under each memory cgroup directory. Cat this file can
print the memory access latency at the memory cgroup level.
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
Introduce background page reclaim mechanism for memcg, it can
be configured according to the cgroup priorities for different
reclaim strategies.
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
Signed-off-by: Mengmeng Chen <bauerchen@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
Under memory pressure reclaim and oom would happen,
with multiple cgroups exist in one system, we might
want some of their memory or tasks survived the
reclaim and oom while there are other cadidates.
When oom happens it always choose victim from low
priority memcg. And it works both for memcg oom and
global oom, it can be enabled/disabled through
@memory.use_priority_oom, for global oom through the root
memcg's @memory.use_priority_oom, it is disabled by default.
Signed-off-by: Haiwei Li <gerryhwli@tencent.com>
Signed-off-by: Mengmeng Chen <bauerchen@tencent.com>
Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
The global pagecache limit function fails due to backport the
upstream commit. In the scenario where the active file list
needs to be reclaimed, it cannot reclaim the LRU_ACTIVE_FILE
list, making the pagecache limit inaccurate.
When shrinking page cache, we set an initial value for
may_deactivate in scan_control to DEACTIVATE_FILE, allowing
the active file list to be scanned in shrink_list.
Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Hongbo Li <herberthbli@tencent.com>
It assigns the net_device pointer of network interface to
sock->in_dev in cls_tc_rx_hook() in the receiving process.
The use of a sock->in_dev pointer can potentially lead to
wrong memory access if the memory of struct net_device is
freed after network interface is unregistered, which may
cause kernel crash.
The above use after free issue causes a crash as follows:
BUG: unable to handle page fault for address: ffffffed698999c8
CPU: 50 PID: 1290732 Comm: kubelet Kdump: loaded
Tainted: G O K 5.4.119-1-tlinux4-0009.1 #1
RIP: 0010:cls_cgroup_tx_accept+0x5e/0x120
Call Trace:
<IRQ>
cls_tc_tx_hook+0x10d/0x1a0
nf_hook_slow+0x43/0xc0
__ip_local_out+0xcb/0x130
? ip_forward_options+0x190/0x190
ip_local_out+0x1c/0x40
__ip_queue_xmit+0x162/0x3d0
? rx_cgroup_throttle.isra.4+0x2b0/0x2b0
ip_queue_xmit+0x10/0x20
__tcp_transmit_skb+0x57f/0xbe0
__tcp_retransmit_skb+0x1b0/0x8a0
tcp_retransmit_skb+0x19/0xd0
tcp_retransmit_timer+0x367/0xa80
? kvm_clock_get_cycles+0x11/0x20
? ktime_get+0x34/0x90
tcp_write_timer_handler+0x93/0x1f0
tcp_write_timer+0x7c/0x80
? tcp_write_timer_handler+0x1f0/0x1f0
call_timer_fn+0x35/0x130
run_timer_softirq+0x1a8/0x420
? ktime_get+0x34/0x90
? clockevents_program_event+0x85/0xe0
__do_softirq+0x8c/0x2d7
? hrtimer_interrupt+0x12a/0x210
irq_exit+0xa3/0xb0
smp_apic_timer_interrupt+0x77/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
We introduce indev_ifindex as a new struct filed to record
the ifindex of net_device, and then indev_ifindex can be
used for obtaining an index to avoid direct memory access
to struct members of in_dev pointer.
Fixes: f8829546f3b3 ("rue/net: init netcls traffic controller")
Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Ze Gao <zegao@tencent.com>
The memory of struct cgroup_cls_state may be freed
during the use of a pointer to the struct. This issue
can potentially lead to wrong memory access and thus
kernel crashes.
Increase the reference count of struct cgroup_cls_state
through css_tryget_online while the struct is in use.
The above causes a crash as follows:
CPU: 56 PID: 4161866 Comm: AppSourceDatapr Kdump: loaded
Tainted: G O 5.4.119-1-tlinux4-0008 #1
RIP: 0010:cls_cgroup_adjust_wnd+0x58/0x180
Call Trace:
<IRQ>
__tcp_transmit_skb+0x6a8/0xbe0
__tcp_send_ack.part.50+0xc2/0x170
tcp_send_ack+0x1c/0x20
tcp_send_dupack+0x29/0x130
? kvm_clock_get_cycles+0x11/0x20
tcp_validate_incoming+0x332/0x440
tcp_rcv_established+0x1f6/0x670
tcp_v4_do_rcv+0x18a/0x220
tcp_v4_rcv+0xbfd/0xca0
ip_protocol_deliver_rcu+0x1f/0x180
ip_local_deliver_finish+0x51/0x60
ip_local_deliver+0xcd/0xe0
? ip_protocol_deliver_rcu+0x180/0x180
ip_rcv_finish+0x7b/0x90
ip_rcv+0xb5/0xc0
? ip_rcv_finish_core.isra.18+0x380/0x380
__netif_receive_skb_one_core+0x59/0x80
__netif_receive_skb+0x26/0x70
process_backlog+0xac/0x150
net_rx_action+0x127/0x380
? ktime_get+0x34/0x90
__do_softirg+0x8c/0x2d7
irq_exit+0xa3/0xb0
smp_call_function_single_interrupt+0x4c/0xd0
call_function_single_interrupt+0xf/0x20
</IRQ>
Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Ze Gao <zegao@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
Add to register and unregister rue net ops through
rue modular framework.
Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
Introduce netcls controller interface files, which can be
configured to enable/disable bandwidth allocation mechanism
among online net cgroups.
The mechanism realizes the migration of idle bandwidth resources
among online cgroups, while guaranteeing the minimum bandwidth
for per-cgroup, to improve resource utilization.
Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
Introduce the total bandwidth limit mechanism for rx && tx direction.
Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
Introduce the bandwidth rate limit mechanism for per cgroup.
Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
Export the two functions again for module like RUE
This reverts commit 0bd476e6c6.
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
Add framework support to enable rue to be installed as
a separate module.
In order to safely insmod/rmmod, we use per-cpu counter to
track how many rue related functions are on the fly, and
it's only safe to insmod/rmmod when there's no tasks using
any of these functions registered by rue module.
Signed-off-by: Ze Gao <zegao@tencent.com>
Add the init code of rue module.
Support both built-in and module(default) way.
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Honglin Li <honglinli@tencent.com>
Add cgroup priority.
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Lei Chen <lennychen@tencent.com>
Signed-off-by: Yu Liu <allanyuliu@tencent.com>
Upstream: no
In 6dfa517032, unified the blkcg_part_stat_add without
implicitly pass cpu number. Add CONFIG_BLK_CGROUP_DISKSTATS
is depends on CONFIG_BLK_CGROUP, so no need to define
blkcg_part_stat_add() when CONFIG_BLK_CGROUP disabled.
Correct the error msg "implicit declaration of function
‘blkcg_dkstats_show_comm’" when disable CONFIG_BLK_CGROUP_DISKSTATS
Fixes: 6dfa517032 ("blkcg/diskstats: add per blkcg diskstats support")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Upstream: no
Watermark boost factor controls the level of reclaim when memory is
being fragmented. The intent is that compaction has less work to do in the
future and to increase the success rate of future high-order allocations
such as SLUB allocations, THP and hugetlbfs pages.
However, it wakeup kswapd to do defragmentation, the action caused
performance jitter in many cases without enough gain.
In some distributions like Debian, also set the default boost
fator to 0 to disable the feature.
WXG Story of compaction cause performance jitter:
https://doc.weixin.qq.com/doc/w3_AIAAcwacAAYudo6ERcUQMiNUbmvzb?scode=AJEAIQdfAAoeO7AbqSAYQATQaYAJg
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Reviewed-by: Jianping Liu <frankjpliu@tencent.com>
open dpc and edr which can enable pcie edpc function,
when uce comes, edpc could reset device link, resume deveice,
which likes to hotplug this device.
Signed-off-by: Aurelianliu <aurelianliu@tencent.com>
Fix CVE: CVE-2024-43868
[ Upstream commit fb197c5d2fd24b9af3d4697d0cf778645846d6d5 ]
When alignment handling is delegated to the kernel, everything must be
word-aligned in purgatory, since the trap handler is then set to the
kexec one. Without the alignment, hitting the exception would
ultimately crash. On other occasions, the kernel's handler would take
care of exceptions.
This has been tested on a JH7110 SoC with oreboot and its SBI delegating
unaligned access exceptions and the kernel configured to handle them.
Fixes: 736e30af58 ("RISC-V: Add purgatory")
Signed-off-by: Daniel Maslowski <cyrevolt@gmail.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240719170437.247457-1-cyrevolt@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
To avoid the log like below:
[ 0.095948] *************************************************************
[ 0.095948] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.096220] ** **
[ 0.096221] ** IOMMU DebugFS SUPPORT HAS BEEN ENABLED IN THIS KERNEL **
[ 0.096222] ** **
[ 0.096223] ** This means that this kernel is built to expose internal **
[ 0.096224] ** IOMMU data structures, which may compromise security on **
[ 0.096225] ** your system. **
[ 0.096227] ** **
[ 0.096227] ** If you see this message and you are not debugging the **
[ 0.096228] ** kernel, report this immediately to your vendor! **
[ 0.096229] ** **
[ 0.096230] ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
[ 0.096231] *************************************************************
disable CONFIG_IOMMU_DEBUGFS.
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
When CONFIG_KASAN is enabled, the kernel will run more slower, set
hung_task and soft lockup thresh time to 600 seconds.
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>