Commit Graph

1228257 Commits

Author SHA1 Message Date
Haisu Wang 701785a3f8 rue/io: skip throttle REQ_META/REQ_PRIO IO
Won't throttle REQ_META/REQ_PRIO and kswapd IO
when skip_throttle_prio_req enabled

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang f630af7168 rue/io: buffered_write_bps hierarchy support
Support hierarchy setting of buffered_write_bps

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang 701147d7b1 rue/io: support readwrite unified configuration
Support readwrite unified configuration, so we can more
easily configure the bps/iops of the cgroup.

Add readwrite_dynamic_ratio interface. Support anticipate ratio
via previous data to control readwrite block throttle dynamically.

Anticipate readwrite ratio based on dispatched bytes/iops in the
last slice. Considering read and write slice are not aligned and
able to trim orextent. Use the elapsed slice numbers to get an
approximate rate.

Tencent-internal-TAPDID: 878345747
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
Reviewed-by: Hongbo Li <herberthbli@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang 8b986f7bdc rue/io: Add iocost and iolatency entry for cgroup v1
Add entry of iocost and iolatency for cgroup v1

The effective weight of iocost sometimes may differs from the weight
that users configured. This patch displays useful information for each
cgroup's blk.cost.stat.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang fed4a7c8be rue/io: add io_cgv1_buff_wb to enable buffer IO counting in cgroup v1
Add a sysctl switch to control buffer IO counting in
memcg of cgroup v1. If turn on this switch, remove
memory cgroup may leave zombie slabs until wb finished.

Need to turn on io_qos and io_cgv1_buff_wb in cgroup v1.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Bin Lai <robinlai@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang 826a0366a1 rue/io: introduce per mem_cgroup sync interface
Introduce the per cgroup.sync interface, so that we can ensure
that the dirty pages of the cgroup are actually written to the
disk without considering the dirty pages generated elsewhere.
This can avoid the problem of large cgroup exit delay caused
by system-level sync and avoid the problem of IO jitter.

Note:
struct wb_writeback_work moved from fs/fs-writeback.c to
include/linux/writeback.h

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang a12bb1a43d rue/io: add bufio isolation based for cgroup v1
Add buffer IO isolation bind_blkio based on v2 infrastructure to v1,
so we can unify the interface for dio and bufio.

Add sysctl switch to allow migrate already bind cgroup.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang 1b1b938068 rue/io: Add bps information to blkio.throttle.stat
Bps information is missing in blkio.throttle.stat

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang a65dd3dd13 rue/io: Add blkio.throttle.stat
Add blkio.throttle.stat to show throttle stat

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-28 15:42:22 +08:00
Haisu Wang 1860b51781 rue/io: add buffer IO writeback throtl for cgroup v1
Add buffer IO throttle for cgroup v1 base on dirty throttle,
Since the actual IO speed is not considered, this solution
may cause the continuous accumulation of dirty pages in the
IO performance bottleneck scenario, which will lead to the
deterioration of the isolation effect.

Note:
struct blkcg moved from block/blk-cgroup.h to
include/linux/blk-cgroup.h

Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-28 15:41:58 +08:00
Haisu Wang 286c5f95c6 rue/io: add io_qos switch and throtl hierarchy
Use sysctl_io_qos as rue IO function switch.
Also support blk throttle hierarchy and enable by default.

Note:
throttle hierarchy won't effected by kernel.io_qos since it
is linked in initialized phase

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
2024-09-28 15:41:58 +08:00
Haisu Wang 2497ec22c1 rue/io: Enable CONFIG_BLK_DEV_THROTTLING_CGROUP_V1 configuration
Make CONFIG_BLK_DEV_THROTTLING_CGROUP_V1 enable by default.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:31:27 +08:00
Haisu Wang fd71891bb8 rue/io: Correct the alloc type to disk_stats
Upstream: no

For non-SMP build, also alloc the dkstats dynamiclly. however,
wrong struct type is assigned.

Fixes: 6dfa517032 ("blkcg/diskstats: add per blkcg diskstats support")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:31:26 +08:00
Haisu Wang 495e0e311e rue/io: add support for recursive diskstats
Blkcg add recursive diskstats.
Fix the issue print the last partition in original solution and
remove the list.

Note:
This function just for backward compatible of tkernel4. Since
commit f733164829 ("blk-cgroup: reimplement basic IO stats
using cgroup rstat") implement blkg_iostat_set for cgroup stat
in blkcg_gq.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Lenny Chen <lennychen@tencent.com>
2024-09-27 11:31:26 +08:00
Haisu Wang f35df3f918 rue/io: blkcg export blkcg symbols to be used in bpf accounting
Make block cgroup I/O completion and done function dynamic
to account per cgroup I/O status in ebpf.

Fix blkcg_dkstats.alloc_node not undefined blkcg_dkstats.alloc_node
only available when CONFIG_SMP enabled, move the INIT to the right
place.
Export blkcg symbols to be used in bpf accounting.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Alex Shi <alexsshi@tencent.com>
2024-09-27 11:25:19 +08:00
Haojie Ning a1574c433d rue/mm: add sysctl_vm_use_priority_oom to enable priority oom for all cgroups
Add sysctl_vm_use_priority_oom as a global setting to enable the
priority_oom setting for all cgroups without the need to manually
set it for each cgroup. This global setting has no effect when it
is turned off.

Signed-off-by: Haojie Ning <paulning@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:32 +08:00
Honglin Li 7c45f9b01f rue/mm: compatible with mglru for pagecache limit
The pagecache limit for system and per-cgroup will
cause the process to get stuck when mglru is enabled.
Use lru_gen_enabled() to check whether mglru is
enabled in the system.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
2024-09-27 11:13:32 +08:00
Xin Hao 4e6f350b03 rue/mm: fix file page_counter 'memcg->pagecache' error when THP enabled
When the CONFIG_MEM_QPS feature is enabled, the __mod_lruvec_state function is
called to increase the page_counter 'pagecache' value in per-memcg by 'NR_FILE_PAGES',
which is not a problem if THP is not enabled, but if THP is enabled, the CONFIG_MEM_QPS
feature forgot to increase the value of page_counter 'pagecache', because THP pagecache
becomes 'NR_FILE_THPS' type. it will lead to the page_counter 'pagecache' val becomes
negative val when these THP pagecache pages is released, so the results in the following
warning situation.

[55530.397796] ------------[ cut here ]------------
[55530.398854] page_counter underflow: -512 nr_pages=512
[55530.399864] WARNING: CPU: 1 PID: 3026157 at mm/page_counter.c:63 page_counter_cancel+0x55/0x60
[55530.412193] CPU: 1 PID: 3026157 Comm: bash Kdump: loaded Tainted: G
[55530.416075] RIP: 0010:page_counter_cancel+0x55/0x60
[55530.421353] RAX: 0000000000000000 RBX: ffff8888161a8270 RCX: 0000000000000006
[55530.422680] RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff88881f85bb60
[55530.424008] RBP: ffffc90004ceba58 R08: 0000000000009617 R09: ffff88881584c820
[55530.425330] R10: 0000000000000000 R11: ffffffffa00d60b0 R12: 0000000000000200
[55530.426663] R13: ffff8888194f7000 R14: 0000000000000000 R15: 0000000000000000
[55530.427999] FS:  00007fe2932d1740(0000) GS:ffff88881f840000(0000) knlGS:0000000000000000
[55530.429447] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55530.430645] CR2: 00007f97c4e00000 CR3: 00000007e7256004 CR4: 00000000003706e0
[55530.432007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55530.433360] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[55530.434711] Call Trace:
[55530.435541]  page_counter_uncharge+0x22/0x40
[55530.436571]  __mod_memcg_state.part.80+0x79/0xe0
[55530.437645]  __mod_memcg_lruvec_state+0x27/0x110
[55530.438712]  __mod_lruvec_state+0x39/0x40
[55530.439712]  unaccount_page_cache_page+0xd0/0x210
[55530.440803]  __delete_from_page_cache+0x3d/0x1d0
[55530.441877]  __remove_mapping+0xeb/0x220
[55530.442871]  remove_mapping+0x16/0x30
[55530.443836]  invalidate_inode_page+0x84/0x90
[55530.444869]  invalidate_mapping_pages+0x162/0x3e0
[55530.445957]  ? pick_next_task_fair+0x1f2/0x520
[55530.446996]  drop_pagecache_sb+0xac/0x130
[55530.447972]  iterate_supers+0xa2/0x110
[55530.448907]  ? do_coredump+0xb20/0xb20
[55530.449840]  drop_caches_sysctl_handler+0x5d/0x90
[55530.450893]  proc_sys_call_handler+0x1d0/0x290
[55530.451906]  proc_sys_write+0x14/0x20
[55530.452830]  __vfs_write+0x1b/0x40
[55530.453722]  vfs_write+0xab/0x1b0
[55530.454598]  ksys_write+0x61/0xe0
[55530.455471]  __x64_sys_write+0x1a/0x20
[55530.456392]  do_syscall_64+0x4d/0x120
[55530.457296]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[55530.458346] RIP: 0033:0x7fe292836bc8

Fixes: a0d7d9851512 ("rue/mm: pagecache limit per cgroup support")
Signed-off-by: Xin Hao <vernhao@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li b82ababba6 rue/mm: introduce new feature to async clean dying memcgs
When memcg was removed, page caches and slab pages still
reference to this memcg, it will cause very large number
of dying memcgs in out system. This feature can async to
clean dying memcgs in system.

1) sysctl -w vm.clean_dying_memcg_async=1
   #start a kthread to async clean dying memcgs, default
   #value is 0.

2) sysctl -w vm.clean_dying_memcg_threshold=10
   #Whenever 10 dying memcgs are generated in the system,
   #wakeup a kthread to async clean dying memcgs, default
   #value is 100.

Signed-off-by: Bin Lai <robinlai@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 200560da23 rue/mm: introduce memcg page cache hit & miss ratio tool
A new memory.page_cache_hit control file is added
under each memory cgroup directory. Cat this file can
print page cache hit and miss ratio at the memory
cgroup level.

Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 8de07be077 rue/mm: introduce memory allocation latency for per-cgroup tool
A new memory.latency_histogram control file is added
under each memory cgroup directory. Cat this file can
print the memory access latency at the memory cgroup level.

Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 1824581599 rue/mm: async free memory while process exiting
Introduce async free memory while process exiting
to shorten exit time.

Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 75ad2bae3d rue/mm: pagecache limit per cgroup support
Functional test:
http://tapd.oa.com/TencentOS_QoS/prong/stories/view/
1020426664867405667?jump_count=1

Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Jingxiang Zeng <linuszeng@tencent.com>
Signed-off-by: Xuan Liu <benxliu@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 56d80c4ea2 rue/mm: add memory cgroup async page reclaim mechanism
Introduce background page reclaim mechanism for memcg, it can
be configured according to the cgroup priorities for different
reclaim strategies.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
Signed-off-by: Mengmeng Chen <bauerchen@tencent.com>
Signed-off-by: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li 0d35c4c639 rue/mm: introduce memcg priority oom
Under memory pressure reclaim and oom would happen,
with multiple cgroups exist in one system, we might
want some of their memory or tasks survived the
reclaim and oom while there are other cadidates.

When oom happens it always choose victim from low
priority memcg. And it works both for memcg oom and
global oom, it can be enabled/disabled through
@memory.use_priority_oom, for global oom through the root
memcg's @memory.use_priority_oom, it is disabled by default.

Signed-off-by: Haiwei Li <gerryhwli@tencent.com>
Signed-off-by: Mengmeng Chen <bauerchen@tencent.com>
Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:31 +08:00
Honglin Li db44c11cdd rue/mm: add priority reclaim support
Introduce the sync && async priority reclaim mechanism.

Signed-off-by: Yu Liu <allanyuliu@tencent.com>
Signed-off-by: Xiaoguang Chen <xiaoggchen@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 04f49a445c pagecachelimit: set an initial value for may_deactivate in shrink page cache
The global pagecache limit function fails due to backport the
upstream commit. In the scenario where the active file list
needs to be reclaimed, it cannot reclaim the LRU_ACTIVE_FILE
list, making the pagecache limit inaccurate.

When shrinking page cache, we set an initial value for
may_deactivate in scan_control to DEACTIVATE_FILE, allowing
the active file list to be scanned in shrink_list.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Hongbo Li <herberthbli@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 26941c0f5e rue/net: avoid wrong memory access to struct net_device
It assigns the net_device pointer of network interface to
sock->in_dev in cls_tc_rx_hook() in the receiving process.
The use of a sock->in_dev pointer can potentially lead to
wrong memory access if the memory of struct net_device is
freed after network interface is unregistered, which may
cause kernel crash.

The above use after free issue causes a crash as follows:

BUG: unable to handle page fault for address: ffffffed698999c8
CPU: 50 PID: 1290732 Comm: kubelet Kdump: loaded
Tainted: G O K 5.4.119-1-tlinux4-0009.1 #1
RIP: 0010:cls_cgroup_tx_accept+0x5e/0x120
Call Trace:
 <IRQ>
 cls_tc_tx_hook+0x10d/0x1a0
 nf_hook_slow+0x43/0xc0
 __ip_local_out+0xcb/0x130
 ? ip_forward_options+0x190/0x190
 ip_local_out+0x1c/0x40
 __ip_queue_xmit+0x162/0x3d0
 ? rx_cgroup_throttle.isra.4+0x2b0/0x2b0
 ip_queue_xmit+0x10/0x20
 __tcp_transmit_skb+0x57f/0xbe0
 __tcp_retransmit_skb+0x1b0/0x8a0
 tcp_retransmit_skb+0x19/0xd0
 tcp_retransmit_timer+0x367/0xa80
 ? kvm_clock_get_cycles+0x11/0x20
 ? ktime_get+0x34/0x90
 tcp_write_timer_handler+0x93/0x1f0
 tcp_write_timer+0x7c/0x80
 ? tcp_write_timer_handler+0x1f0/0x1f0
 call_timer_fn+0x35/0x130
 run_timer_softirq+0x1a8/0x420
 ? ktime_get+0x34/0x90
 ? clockevents_program_event+0x85/0xe0
 __do_softirq+0x8c/0x2d7
 ? hrtimer_interrupt+0x12a/0x210
 irq_exit+0xa3/0xb0
 smp_apic_timer_interrupt+0x77/0x130
 apic_timer_interrupt+0xf/0x20
 </IRQ>

We introduce indev_ifindex as a new struct filed to record
the ifindex of net_device, and then indev_ifindex can be
used for obtaining an index to avoid direct memory access
to struct members of in_dev pointer.

Fixes: f8829546f3b3 ("rue/net: init netcls traffic controller")
Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Ze Gao <zegao@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 68a7910a16 rue/net: avoid wrong memory access to struct cgroup_cls_state
The memory of struct cgroup_cls_state may be freed
during the use of a pointer to the struct. This issue
can potentially lead to wrong memory access and thus
kernel crashes.

Increase the reference count of struct cgroup_cls_state
through css_tryget_online while the struct is in use.

The above causes a crash as follows:

CPU: 56 PID: 4161866 Comm: AppSourceDatapr Kdump: loaded
Tainted: G O 5.4.119-1-tlinux4-0008 #1
RIP: 0010:cls_cgroup_adjust_wnd+0x58/0x180
Call Trace:
 <IRQ>
 __tcp_transmit_skb+0x6a8/0xbe0
 __tcp_send_ack.part.50+0xc2/0x170
 tcp_send_ack+0x1c/0x20
 tcp_send_dupack+0x29/0x130
 ? kvm_clock_get_cycles+0x11/0x20
 tcp_validate_incoming+0x332/0x440
 tcp_rcv_established+0x1f6/0x670
 tcp_v4_do_rcv+0x18a/0x220
 tcp_v4_rcv+0xbfd/0xca0
 ip_protocol_deliver_rcu+0x1f/0x180
 ip_local_deliver_finish+0x51/0x60
 ip_local_deliver+0xcd/0xe0
 ? ip_protocol_deliver_rcu+0x180/0x180
 ip_rcv_finish+0x7b/0x90
 ip_rcv+0xb5/0xc0
 ? ip_rcv_finish_core.isra.18+0x380/0x380
 __netif_receive_skb_one_core+0x59/0x80
 __netif_receive_skb+0x26/0x70
 process_backlog+0xac/0x150
 net_rx_action+0x127/0x380
 ? ktime_get+0x34/0x90
 __do_softirg+0x8c/0x2d7
 irq_exit+0xa3/0xb0
 smp_call_function_single_interrupt+0x4c/0xd0
 call_function_single_interrupt+0xf/0x20
 </IRQ>

Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Ze Gao <zegao@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 55f6748cd1 rue/net: adapt to the new rue modular framework
Add to register and unregister rue net ops through
rue modular framework.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li ca8edadc91 rue/net: add dynamic bandwidth allocation between online cgroups
Introduce netcls controller interface files, which can be
configured to enable/disable bandwidth allocation mechanism
among online net cgroups.

The mechanism realizes the migration of idle bandwidth resources
among online cgroups, while guaranteeing the minimum bandwidth
for per-cgroup, to improve resource utilization.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Jason Xing <kernelxing@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 3811ff7c02 rue/net: add netdev-based rate limit for per cgroup
Introduce netdev-based rate limit for rx && tx direction.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Reviewed-by: Zhiping Du <zhipingdu@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 9a447c5cb9 rue/net: add total bandwidth limit for multiprio preemption
Introduce the total bandwidth limit mechanism for rx && tx direction.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li 703664bf47 rue/net: add support for cgroup whitelist ports
Introduce the cgroup whitelist ports mechanism.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
2024-09-27 11:13:30 +08:00
Honglin Li ca0f6ddd21 rue/net: add rx && tx rate limit for per cgroup
Introduce the bandwidth rate limit mechanism for per cgroup.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
2024-09-27 11:13:29 +08:00
Honglin Li 669bbf19cd rue/net: init netcls traffic controller
Add multiprio dynamic bandwidth controller.

Signed-off-by: Honglin Li <honglinli@tencent.com>
Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Zhiping Du <zhipingdu@tencent.com>
2024-09-27 11:13:29 +08:00
Haisu Wang 0f93976785 rue: Revert "kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()"
Export the two functions again for module like RUE

This reverts commit 0bd476e6c6.

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:29 +08:00
Ze Gao d5a175186d rue: Add support for rue modularization
Add framework support to enable rue to be installed as
a separate module.

In order to safely insmod/rmmod, we use per-cpu counter to
track how many rue related functions are on the fly, and
it's only safe to insmod/rmmod when there's no tasks using
any of these functions registered by rue module.

Signed-off-by: Ze Gao <zegao@tencent.com>
2024-09-27 11:13:29 +08:00
Hongbo Li 5dc70a633d rue: init rue module
Add the init code of rue module.
Support both built-in and module(default) way.

Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: Honglin Li <honglinli@tencent.com>
2024-09-27 11:13:29 +08:00
Hongbo Li fce3609ebf rue: cgroup priority
Add cgroup priority.

Signed-off-by: Hongbo Li <herberthbli@tencent.com>
Signed-off-by: Lei Chen  <lennychen@tencent.com>
Signed-off-by: Yu Liu    <allanyuliu@tencent.com>
2024-09-27 11:13:29 +08:00
Haisu Wang 61bf5b5b7e blkcg/diskstats: Fix the extra cpu parameter
Upstream: no

In 6dfa517032, unified the blkcg_part_stat_add without
implicitly pass cpu number. Add CONFIG_BLK_CGROUP_DISKSTATS
is depends on CONFIG_BLK_CGROUP, so no need to define
blkcg_part_stat_add() when CONFIG_BLK_CGROUP disabled.

Correct the error msg "implicit declaration of function
‘blkcg_dkstats_show_comm’" when disable CONFIG_BLK_CGROUP_DISKSTATS

Fixes: 6dfa517032 ("blkcg/diskstats: add per blkcg diskstats support")
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:29 +08:00
Haisu Wang 2fc4b0e9c0 mm: set default watermark_boost_factor value to 0
Upstream: no

Watermark boost factor controls the level of reclaim when memory is
being fragmented. The intent is that compaction has less work to do in the
future and to increase the success rate of future high-order allocations
such as SLUB allocations, THP and hugetlbfs pages.
However, it wakeup kswapd to do defragmentation, the action caused
performance jitter in many cases without enough gain.

In some distributions like Debian, also set the default boost
fator to 0 to disable the feature.

WXG Story of compaction cause performance jitter:
https://doc.weixin.qq.com/doc/w3_AIAAcwacAAYudo6ERcUQMiNUbmvzb?scode=AJEAIQdfAAoeO7AbqSAYQATQaYAJg

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
Reviewed-by:  Jianping Liu <frankjpliu@tencent.com>
2024-09-27 11:13:28 +08:00
Haisu Wang b03afc0d33 Revert "io/tqos: merge buffer io limit series patch from brookxu, and rework some function."
This reverts commit 538ec11bed.

Revert due to refactory the buffer IO function.
In TK5, unnecessary to compatible kabi by using the "nodeinfo"
in "struct mem_cgroup {}".

Original tapd and MR:
  https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502
  https://git.woa.com/tlinux/tkernel5/-/merge_requests/117

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:24 +08:00
Haisu Wang 3231efb956 Revert "io/tqos: add sysctl_buffer_io_limit switch for buffer io limit."
This reverts commit 4d87de6bb4.

Revert due to refactory the buffer IO function.
In TK5, unnecessary to compatible kabi by using the "nodeinfo"
in "struct mem_cgroup {}".

Original tapd and MR:
  https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502
  https://git.woa.com/tlinux/tkernel5/-/merge_requests/117

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:13:21 +08:00
Haisu Wang 24cfc0a666 Revert "cgroup: allow cgroup to split direct io and buffered io into different blkio cgroup"
This reverts commit 71aaa09350.

Revert due to refactory the buffer IO function.
In TK5, unnecessary to compatible kabi by using the "nodeinfo"
in "struct mem_cgroup {}".

Original tapd and MR:
  https://tapd.woa.com/tapd_fe/20422414/story/detail/1020422414117471502
  https://git.woa.com/tlinux/tkernel5/-/merge_requests/117

Signed-off-by: Haisu Wang <haisuwang@tencent.com>
2024-09-27 11:12:41 +08:00
aurelianliu b1e1aed588 config,x86: open edr
open dpc and edr which can enable pcie edpc function,
when uce comes, edpc could reset device link, resume deveice,
which likes to hotplug this device.

Signed-off-by: Aurelianliu <aurelianliu@tencent.com>
2024-09-11 02:06:12 +00:00
Daniel Maslowski e16058ed64 riscv/purgatory: align riscv_kernel_entry
Fix CVE: CVE-2024-43868

[ Upstream commit fb197c5d2fd24b9af3d4697d0cf778645846d6d5 ]

When alignment handling is delegated to the kernel, everything must be
word-aligned in purgatory, since the trap handler is then set to the
kexec one. Without the alignment, hitting the exception would
ultimately crash. On other occasions, the kernel's handler would take
care of exceptions.
This has been tested on a JH7110 SoC with oreboot and its SBI delegating
unaligned access exceptions and the kernel configured to handle them.

Fixes: 736e30af58 ("RISC-V: Add purgatory")
Signed-off-by: Daniel Maslowski <cyrevolt@gmail.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240719170437.247457-1-cyrevolt@gmail.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
2024-09-10 19:36:42 +08:00
Jianping Liu 7a6899b55a config,x86: disable CONFIG_IOMMU_DEBUGFS
To avoid the log like below:
[    0.095948] *************************************************************
[    0.095948] **     NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE    **
[    0.096220] **                                                         **
[    0.096221] **  IOMMU DebugFS SUPPORT HAS BEEN ENABLED IN THIS KERNEL  **
[    0.096222] **                                                         **
[    0.096223] ** This means that this kernel is built to expose internal **
[    0.096224] ** IOMMU data structures, which may compromise security on **
[    0.096225] ** your system.                                            **
[    0.096227] **                                                         **
[    0.096227] ** If you see this message and you are not debugging the   **
[    0.096228] ** kernel, report this immediately to your vendor!         **
[    0.096229] **                                                         **
[    0.096230] **     NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE    **
[    0.096231] *************************************************************
disable CONFIG_IOMMU_DEBUGFS.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-09-06 15:03:58 +08:00
Jianping Liu 64a21c8a25 hung_task,watchdog: set thresh time to 600 seconds
When CONFIG_KASAN is enabled, the kernel will run more slower, set
hung_task and soft lockup thresh time to 600 seconds.

Signed-off-by: Jianping Liu <frankjpliu@tencent.com>
Reviewed-by: Yongliang Gao <leonylgao@tencent.com>
2024-09-05 15:24:07 +08:00
Jianping Liu 2748b6ef40 Merge OCK next branch to TK5 master branch 2024-09-03 11:26:15 +08:00