2019-06-01 16:08:42 +08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-only
|
2009-02-20 15:29:08 +08:00
|
|
|
/*
|
2010-04-09 17:57:01 +08:00
|
|
|
* mm/percpu.c - percpu memory allocator
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
|
|
|
* Copyright (C) 2009 SUSE Linux Products GmbH
|
|
|
|
* Copyright (C) 2009 Tejun Heo <tj@kernel.org>
|
|
|
|
*
|
2017-07-25 07:02:20 +08:00
|
|
|
* Copyright (C) 2017 Facebook Inc.
|
2020-04-02 01:07:48 +08:00
|
|
|
* Copyright (C) 2017 Dennis Zhou <dennis@kernel.org>
|
2017-07-25 07:02:20 +08:00
|
|
|
*
|
2017-07-16 10:23:09 +08:00
|
|
|
* The percpu allocator handles both static and dynamic areas. Percpu
|
|
|
|
* areas are allocated in chunks which are divided into units. There is
|
|
|
|
* a 1-to-1 mapping for units to possible cpus. These units are grouped
|
|
|
|
* based on NUMA properties of the machine.
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
|
|
|
* c0 c1 c2
|
|
|
|
* ------------------- ------------------- ------------
|
|
|
|
* | u0 | u1 | u2 | u3 | | u0 | u1 | u2 | u3 | | u0 | u1 | u
|
|
|
|
* ------------------- ...... ------------------- .... ------------
|
|
|
|
*
|
2017-07-16 10:23:09 +08:00
|
|
|
* Allocation is done by offsets into a unit's address space. Ie., an
|
|
|
|
* area of 512 bytes at 6k in c1 occupies 512 bytes at 6k in c1:u0,
|
|
|
|
* c1:u1, c1:u2, etc. On NUMA machines, the mapping may be non-linear
|
|
|
|
* and even sparse. Access is handled by configuring percpu base
|
|
|
|
* registers according to the cpu to unit mappings and offsetting the
|
|
|
|
* base address using pcpu_unit_size.
|
|
|
|
*
|
|
|
|
* There is special consideration for the first chunk which must handle
|
|
|
|
* the static percpu variables in the kernel image as allocation services
|
2017-07-25 07:02:20 +08:00
|
|
|
* are not online yet. In short, the first chunk is structured like so:
|
2017-07-16 10:23:09 +08:00
|
|
|
*
|
|
|
|
* <Static | [Reserved] | Dynamic>
|
|
|
|
*
|
|
|
|
* The static data is copied from the original section managed by the
|
|
|
|
* linker. The reserved section, if non-zero, primarily manages static
|
|
|
|
* percpu variables from kernel modules. Finally, the dynamic section
|
|
|
|
* takes care of normal allocations.
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
2017-07-25 07:02:20 +08:00
|
|
|
* The allocator organizes chunks into lists according to free size and
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
* memcg-awareness. To make a percpu allocation memcg-aware the __GFP_ACCOUNT
|
|
|
|
* flag should be passed. All memcg-aware allocations are sharing one set
|
|
|
|
* of chunks and all unaccounted allocations and allocations performed
|
|
|
|
* by processes belonging to the root memory cgroup are using the second set.
|
|
|
|
*
|
|
|
|
* The allocator tries to allocate from the fullest chunk first. Each chunk
|
|
|
|
* is managed by a bitmap with metadata blocks. The allocation map is updated
|
|
|
|
* on every allocation and free to reflect the current state while the boundary
|
2017-07-25 07:02:20 +08:00
|
|
|
* map is only updated on allocation. Each metadata block contains
|
|
|
|
* information to help mitigate the need to iterate over large portions
|
|
|
|
* of the bitmap. The reverse mapping from page to chunk is stored in
|
|
|
|
* the page's index. Lastly, units are lazily backed and grow in unison.
|
|
|
|
*
|
|
|
|
* There is a unique conversion that goes on here between bytes and bits.
|
|
|
|
* Each bit represents a fragment of size PCPU_MIN_ALLOC_SIZE. The chunk
|
|
|
|
* tracks the number of pages it is responsible for in nr_pages. Helper
|
|
|
|
* functions are used to convert from between the bytes, bits, and blocks.
|
|
|
|
* All hints are managed in bits unless explicitly stated.
|
2017-07-16 10:23:09 +08:00
|
|
|
*
|
2017-02-28 06:29:56 +08:00
|
|
|
* To use this allocator, arch code should do the following:
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
|
|
|
* - define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr() to translate
|
2009-03-10 15:27:48 +08:00
|
|
|
* regular address to percpu pointer and back if they need to be
|
|
|
|
* different from the default
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
2009-02-24 10:57:21 +08:00
|
|
|
* - use pcpu_setup_first_chunk() during percpu area initialization to
|
|
|
|
* setup the first chunk containing the kernel static percpu area
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
|
|
|
|
2016-03-18 05:19:53 +08:00
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
#include <linux/bitmap.h>
|
2020-10-30 09:38:20 +08:00
|
|
|
#include <linux/cpumask.h>
|
2018-10-31 06:09:49 +08:00
|
|
|
#include <linux/memblock.h>
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
#include <linux/err.h>
|
2017-07-25 07:02:12 +08:00
|
|
|
#include <linux/lcm.h>
|
2009-02-20 15:29:08 +08:00
|
|
|
#include <linux/list.h>
|
2009-07-04 07:11:00 +08:00
|
|
|
#include <linux/log2.h>
|
2009-02-20 15:29:08 +08:00
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/mutex.h>
|
|
|
|
#include <linux/percpu.h>
|
|
|
|
#include <linux/pfn.h>
|
|
|
|
#include <linux/slab.h>
|
2009-03-06 23:44:13 +08:00
|
|
|
#include <linux/spinlock.h>
|
2009-02-20 15:29:08 +08:00
|
|
|
#include <linux/vmalloc.h>
|
2009-03-06 23:44:11 +08:00
|
|
|
#include <linux/workqueue.h>
|
2011-09-27 00:12:53 +08:00
|
|
|
#include <linux/kmemleak.h>
|
2018-03-14 23:27:26 +08:00
|
|
|
#include <linux/sched.h>
|
percpu: make pcpu_alloc() aware of current gfp context
Since 5.7-rc1, on btrfs we have a percpu counter initialization for
which we always pass a GFP_KERNEL gfp_t argument (this happens since
commit 2992df73268f78 ("btrfs: Implement DREW lock")).
That is safe in some contextes but not on others where allowing fs
reclaim could lead to a deadlock because we are either holding some
btrfs lock needed for a transaction commit or holding a btrfs
transaction handle open. Because of that we surround the call to the
function that initializes the percpu counter with a NOFS context using
memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
However it turns out that this is not enough to prevent a possible
deadlock because percpu_alloc() determines if it is in an atomic context
by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
case) and it is not aware that a NOFS context is set.
Because percpu_alloc() thinks it is in a non atomic context it locks the
pcpu_alloc_mutex. This can result in a btrfs deadlock when
pcpu_balance_workfn() is running, has acquired that mutex and is waiting
for reclaim, while the btrfs task that called percpu_counter_init() (and
therefore percpu_alloc()) is holding either the btrfs commit_root
semaphore or a transaction handle (done fs/btrfs/backref.c:
iterate_extent_inodes()), which prevents reclaim from finishing as an
attempt to commit the current btrfs transaction will deadlock.
Lockdep reports this issue with the following trace:
======================================================
WARNING: possible circular locking dependency detected
5.6.0-rc7-btrfs-next-77 #1 Not tainted
------------------------------------------------------
kswapd0/91 is trying to acquire lock:
ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
but task is already holding lock:
ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (fs_reclaim){+.+.}:
fs_reclaim_acquire.part.0+0x25/0x30
__kmalloc+0x5f/0x3a0
pcpu_create_chunk+0x19/0x230
pcpu_balance_workfn+0x56a/0x680
process_one_work+0x235/0x5f0
worker_thread+0x50/0x3b0
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
-> #3 (pcpu_alloc_mutex){+.+.}:
__mutex_lock+0xa9/0xaf0
pcpu_alloc+0x480/0x7c0
__percpu_counter_init+0x50/0xd0
btrfs_drew_lock_init+0x22/0x70 [btrfs]
btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
resolve_indirect_refs+0x120/0xa30 [btrfs]
find_parent_nodes+0x50b/0xf30 [btrfs]
btrfs_find_all_leafs+0x60/0xb0 [btrfs]
iterate_extent_inodes+0x139/0x2f0 [btrfs]
iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
btrfs_ioctl+0x165a/0x3130 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #2 (&fs_info->commit_root_sem){++++}:
down_write+0x38/0x70
btrfs_cache_block_group+0x2ec/0x500 [btrfs]
find_free_extent+0xc6a/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
commit_cowonly_roots+0x55/0x310 [btrfs]
btrfs_commit_transaction+0x509/0xb20 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x93/0xc0
exit_to_usermode_loop+0xf9/0x100
do_syscall_64+0x20d/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #1 (&space_info->groups_sem){++++}:
down_read+0x3c/0x140
find_free_extent+0xef6/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
btrfs_search_slot+0x50c/0xd60 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
__btrfs_update_delayed_inode+0x90/0x280 [btrfs]
__btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
__btrfs_run_delayed_items+0x8e/0x180 [btrfs]
btrfs_commit_transaction+0x31b/0xb20 [btrfs]
iterate_supers+0x87/0xf0
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (&delayed_node->mutex){+.+.}:
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(pcpu_alloc_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/91:
#0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
#1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
#2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
stack backtrace:
CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8f/0xd0
check_noncircular+0x170/0x190
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
to percpu_counter_init() in contextes where it is not reclaim safe,
however that type of approach is discouraged since
memalloc_[nofs|noio]_save() were introduced. Therefore this change
makes pcpu_alloc() look up into an existing nofs/noio context before
deciding whether it is in an atomic context or not.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 09:36:10 +08:00
|
|
|
#include <linux/sched/mm.h>
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
#include <linux/memcontrol.h>
|
2009-02-20 15:29:08 +08:00
|
|
|
|
|
|
|
#include <asm/cacheflush.h>
|
2009-03-10 15:27:48 +08:00
|
|
|
#include <asm/sections.h>
|
2009-02-20 15:29:08 +08:00
|
|
|
#include <asm/tlbflush.h>
|
2009-11-24 14:50:03 +08:00
|
|
|
#include <asm/io.h>
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2017-06-20 07:28:32 +08:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/percpu.h>
|
|
|
|
|
2017-06-20 07:28:30 +08:00
|
|
|
#include "percpu-internal.h"
|
|
|
|
|
2021-04-08 11:57:31 +08:00
|
|
|
/*
|
|
|
|
* The slots are sorted by the size of the biggest continuous free area.
|
|
|
|
* 1-31 bytes share the same slot.
|
|
|
|
*/
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
#define PCPU_SLOT_BASE_SHIFT 5
|
2019-02-26 01:03:50 +08:00
|
|
|
/* chunks in slots below this are subject to being sidelined on failed alloc */
|
|
|
|
#define PCPU_SLOT_FAIL_THRESHOLD 3
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
#define PCPU_EMPTY_POP_PAGES_LOW 2
|
|
|
|
#define PCPU_EMPTY_POP_PAGES_HIGH 4
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2010-09-04 00:22:48 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2009-03-10 15:27:48 +08:00
|
|
|
/* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
|
|
|
|
#ifndef __addr_to_pcpu_ptr
|
|
|
|
#define __addr_to_pcpu_ptr(addr) \
|
2010-02-02 13:38:57 +08:00
|
|
|
(void __percpu *)((unsigned long)(addr) - \
|
|
|
|
(unsigned long)pcpu_base_addr + \
|
|
|
|
(unsigned long)__per_cpu_start)
|
2009-03-10 15:27:48 +08:00
|
|
|
#endif
|
|
|
|
#ifndef __pcpu_ptr_to_addr
|
|
|
|
#define __pcpu_ptr_to_addr(ptr) \
|
2010-02-02 13:38:57 +08:00
|
|
|
(void __force *)((unsigned long)(ptr) + \
|
|
|
|
(unsigned long)pcpu_base_addr - \
|
|
|
|
(unsigned long)__per_cpu_start)
|
2009-03-10 15:27:48 +08:00
|
|
|
#endif
|
2010-09-04 00:22:48 +08:00
|
|
|
#else /* CONFIG_SMP */
|
|
|
|
/* on UP, it's always identity mapped */
|
|
|
|
#define __addr_to_pcpu_ptr(addr) (void __percpu *)(addr)
|
|
|
|
#define __pcpu_ptr_to_addr(ptr) (void __force *)(ptr)
|
|
|
|
#endif /* CONFIG_SMP */
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2017-05-11 01:36:37 +08:00
|
|
|
static int pcpu_unit_pages __ro_after_init;
|
|
|
|
static int pcpu_unit_size __ro_after_init;
|
|
|
|
static int pcpu_nr_units __ro_after_init;
|
|
|
|
static int pcpu_atom_size __ro_after_init;
|
2017-06-20 07:28:30 +08:00
|
|
|
int pcpu_nr_slots __ro_after_init;
|
2021-05-14 14:39:52 +08:00
|
|
|
static int pcpu_free_slot __ro_after_init;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
int pcpu_sidelined_slot __ro_after_init;
|
|
|
|
int pcpu_to_depopulate_slot __ro_after_init;
|
2017-05-11 01:36:37 +08:00
|
|
|
static size_t pcpu_chunk_struct_size __ro_after_init;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2011-11-19 02:55:35 +08:00
|
|
|
/* cpus with the lowest and highest unit addresses */
|
2017-05-11 01:36:37 +08:00
|
|
|
static unsigned int pcpu_low_unit_cpu __ro_after_init;
|
|
|
|
static unsigned int pcpu_high_unit_cpu __ro_after_init;
|
2009-07-04 07:11:00 +08:00
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/* the address of the first chunk which starts with the kernel static area */
|
2017-05-11 01:36:37 +08:00
|
|
|
void *pcpu_base_addr __ro_after_init;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2017-05-11 01:36:37 +08:00
|
|
|
static const int *pcpu_unit_map __ro_after_init; /* cpu -> unit */
|
|
|
|
const unsigned long *pcpu_unit_offsets __ro_after_init; /* cpu -> unit offset */
|
2009-07-04 07:11:00 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
/* group information, used for vm allocation */
|
2017-05-11 01:36:37 +08:00
|
|
|
static int pcpu_nr_groups __ro_after_init;
|
|
|
|
static const unsigned long *pcpu_group_offsets __ro_after_init;
|
|
|
|
static const size_t *pcpu_group_sizes __ro_after_init;
|
2009-08-14 14:00:52 +08:00
|
|
|
|
2009-04-02 12:19:54 +08:00
|
|
|
/*
|
|
|
|
* The first chunk which always exists. Note that unlike other
|
|
|
|
* chunks, this one can be allocated and mapped in several different
|
|
|
|
* ways and thus often doesn't live in the vmalloc area.
|
|
|
|
*/
|
2017-06-20 07:28:30 +08:00
|
|
|
struct pcpu_chunk *pcpu_first_chunk __ro_after_init;
|
2009-04-02 12:19:54 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Optional reserved chunk. This chunk reserves part of the first
|
2017-07-25 07:01:59 +08:00
|
|
|
* chunk and serves it for reserved allocations. When the reserved
|
|
|
|
* region doesn't exist, the following variable is NULL.
|
2009-04-02 12:19:54 +08:00
|
|
|
*/
|
2017-06-20 07:28:30 +08:00
|
|
|
struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init;
|
2009-03-06 13:33:59 +08:00
|
|
|
|
2017-06-20 07:28:30 +08:00
|
|
|
DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
|
2016-05-25 23:48:25 +08:00
|
|
|
static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */
|
2009-02-20 15:29:08 +08:00
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2016-05-25 23:48:25 +08:00
|
|
|
/* chunks which need their map areas extended, protected by pcpu_lock */
|
|
|
|
static LIST_HEAD(pcpu_map_extend_chunks);
|
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
/*
|
2021-06-03 09:09:31 +08:00
|
|
|
* The number of empty populated pages, protected by pcpu_lock.
|
2021-04-08 11:57:33 +08:00
|
|
|
* The reserved chunk doesn't contribute to the count.
|
2014-09-03 02:46:05 +08:00
|
|
|
*/
|
2021-06-03 09:09:31 +08:00
|
|
|
int pcpu_nr_empty_pop_pages;
|
2014-09-03 02:46:05 +08:00
|
|
|
|
2018-08-22 12:53:58 +08:00
|
|
|
/*
|
|
|
|
* The number of populated pages in use by the allocator, protected by
|
|
|
|
* pcpu_lock. This number is kept per a unit per chunk (i.e. when a page gets
|
|
|
|
* allocated/deallocated, it is allocated/deallocated in all units of a chunk
|
|
|
|
* and increments/decrements this count by 1).
|
|
|
|
*/
|
|
|
|
static unsigned long pcpu_nr_populated;
|
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
/*
|
|
|
|
* Balance work is used to populate or destroy chunks asynchronously. We
|
|
|
|
* try to keep the number of populated free pages between
|
|
|
|
* PCPU_EMPTY_POP_PAGES_LOW and HIGH for atomic allocations and at most one
|
|
|
|
* empty chunk.
|
|
|
|
*/
|
2014-09-03 02:46:05 +08:00
|
|
|
static void pcpu_balance_workfn(struct work_struct *work);
|
|
|
|
static DECLARE_WORK(pcpu_balance_work, pcpu_balance_workfn);
|
2014-09-03 02:46:05 +08:00
|
|
|
static bool pcpu_async_enabled __read_mostly;
|
|
|
|
static bool pcpu_atomic_alloc_failed;
|
|
|
|
|
|
|
|
static void pcpu_schedule_balance_work(void)
|
|
|
|
{
|
|
|
|
if (pcpu_async_enabled)
|
|
|
|
schedule_work(&pcpu_balance_work);
|
|
|
|
}
|
2009-03-06 23:44:11 +08:00
|
|
|
|
2017-07-25 07:02:05 +08:00
|
|
|
/**
|
2017-07-25 07:02:06 +08:00
|
|
|
* pcpu_addr_in_chunk - check if the address is served from this chunk
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @addr: percpu address
|
2017-07-25 07:02:05 +08:00
|
|
|
*
|
|
|
|
* RETURNS:
|
2017-07-25 07:02:06 +08:00
|
|
|
* True if the address is served from this chunk.
|
2017-07-25 07:02:05 +08:00
|
|
|
*/
|
2017-07-25 07:02:06 +08:00
|
|
|
static bool pcpu_addr_in_chunk(struct pcpu_chunk *chunk, void *addr)
|
2010-04-09 17:57:00 +08:00
|
|
|
{
|
2017-07-25 07:02:05 +08:00
|
|
|
void *start_addr, *end_addr;
|
|
|
|
|
2017-07-25 07:02:06 +08:00
|
|
|
if (!chunk)
|
2017-07-25 07:02:05 +08:00
|
|
|
return false;
|
2010-04-09 17:57:00 +08:00
|
|
|
|
2017-07-25 07:02:06 +08:00
|
|
|
start_addr = chunk->base_addr + chunk->start_offset;
|
|
|
|
end_addr = chunk->base_addr + chunk->nr_pages * PAGE_SIZE -
|
|
|
|
chunk->end_offset;
|
2017-07-25 07:02:05 +08:00
|
|
|
|
|
|
|
return addr >= start_addr && addr < end_addr;
|
2010-04-09 17:57:00 +08:00
|
|
|
}
|
|
|
|
|
2009-02-24 10:57:21 +08:00
|
|
|
static int __pcpu_size_to_slot(int size)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2009-02-21 15:56:23 +08:00
|
|
|
int highbit = fls(size); /* size is in bytes */
|
2009-02-20 15:29:08 +08:00
|
|
|
return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
|
|
|
|
}
|
|
|
|
|
2009-02-24 10:57:21 +08:00
|
|
|
static int pcpu_size_to_slot(int size)
|
|
|
|
{
|
|
|
|
if (size == pcpu_unit_size)
|
2021-04-19 06:44:16 +08:00
|
|
|
return pcpu_free_slot;
|
2009-02-24 10:57:21 +08:00
|
|
|
return __pcpu_size_to_slot(size);
|
|
|
|
}
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
|
|
|
|
{
|
2019-02-27 02:00:08 +08:00
|
|
|
const struct pcpu_block_md *chunk_md = &chunk->chunk_md;
|
|
|
|
|
|
|
|
if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
|
|
|
|
chunk_md->contig_hint == 0)
|
2009-02-20 15:29:08 +08:00
|
|
|
return 0;
|
|
|
|
|
2019-02-27 02:00:08 +08:00
|
|
|
return pcpu_size_to_slot(chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
2010-04-09 17:57:01 +08:00
|
|
|
/* set the pointer to a chunk in a page struct */
|
|
|
|
static void pcpu_set_page_chunk(struct page *page, struct pcpu_chunk *pcpu)
|
|
|
|
{
|
|
|
|
page->index = (unsigned long)pcpu;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* obtain pointer to a chunk from a page struct */
|
|
|
|
static struct pcpu_chunk *pcpu_get_page_chunk(struct page *page)
|
|
|
|
{
|
|
|
|
return (struct pcpu_chunk *)page->index;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __maybe_unused pcpu_page_idx(unsigned int cpu, int page_idx)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2009-07-04 07:11:00 +08:00
|
|
|
return pcpu_unit_map[cpu] * pcpu_unit_pages + page_idx;
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:05 +08:00
|
|
|
static unsigned long pcpu_unit_page_offset(unsigned int cpu, int page_idx)
|
|
|
|
{
|
|
|
|
return pcpu_unit_offsets[cpu] + (page_idx << PAGE_SHIFT);
|
|
|
|
}
|
|
|
|
|
2010-06-18 17:44:31 +08:00
|
|
|
static unsigned long pcpu_chunk_addr(struct pcpu_chunk *chunk,
|
|
|
|
unsigned int cpu, int page_idx)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2017-07-25 07:02:05 +08:00
|
|
|
return (unsigned long)chunk->base_addr +
|
|
|
|
pcpu_unit_page_offset(cpu, page_idx);
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
/*
|
|
|
|
* The following are helper functions to help access bitmaps and convert
|
|
|
|
* between bitmap offsets to address offsets.
|
|
|
|
*/
|
|
|
|
static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index)
|
|
|
|
{
|
|
|
|
return chunk->alloc_map +
|
|
|
|
(index * PCPU_BITMAP_BLOCK_BITS / BITS_PER_LONG);
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long pcpu_off_to_block_index(int off)
|
|
|
|
{
|
|
|
|
return off / PCPU_BITMAP_BLOCK_BITS;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long pcpu_off_to_block_off(int off)
|
|
|
|
{
|
|
|
|
return off & (PCPU_BITMAP_BLOCK_BITS - 1);
|
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:17 +08:00
|
|
|
static unsigned long pcpu_block_off_to_off(int index, int off)
|
|
|
|
{
|
|
|
|
return index * PCPU_BITMAP_BLOCK_BITS + off;
|
|
|
|
}
|
|
|
|
|
2021-04-08 11:57:35 +08:00
|
|
|
/**
|
|
|
|
* pcpu_check_block_hint - check against the contig hint
|
|
|
|
* @block: block of interest
|
|
|
|
* @bits: size of allocation
|
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
|
|
|
*
|
|
|
|
* Check to see if the allocation can fit in the block's contig hint.
|
|
|
|
* Note, a chunk uses the same hints as a block so this can also check against
|
|
|
|
* the chunk's contig hint.
|
|
|
|
*/
|
|
|
|
static bool pcpu_check_block_hint(struct pcpu_block_md *block, int bits,
|
|
|
|
size_t align)
|
|
|
|
{
|
|
|
|
int bit_off = ALIGN(block->contig_hint_start, align) -
|
|
|
|
block->contig_hint_start;
|
|
|
|
|
|
|
|
return bit_off + bits <= block->contig_hint;
|
|
|
|
}
|
|
|
|
|
2019-02-26 05:41:45 +08:00
|
|
|
/*
|
|
|
|
* pcpu_next_hint - determine which hint to use
|
|
|
|
* @block: block of interest
|
|
|
|
* @alloc_bits: size of allocation
|
|
|
|
*
|
|
|
|
* This determines if we should scan based on the scan_hint or first_free.
|
|
|
|
* In general, we want to scan from first_free to fulfill allocations by
|
|
|
|
* first fit. However, if we know a scan_hint at position scan_hint_start
|
|
|
|
* cannot fulfill an allocation, we can begin scanning from there knowing
|
|
|
|
* the contig_hint will be our fallback.
|
|
|
|
*/
|
|
|
|
static int pcpu_next_hint(struct pcpu_block_md *block, int alloc_bits)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The three conditions below determine if we can skip past the
|
|
|
|
* scan_hint. First, does the scan hint exist. Second, is the
|
|
|
|
* contig_hint after the scan_hint (possibly not true iff
|
|
|
|
* contig_hint == scan_hint). Third, is the allocation request
|
|
|
|
* larger than the scan_hint.
|
|
|
|
*/
|
|
|
|
if (block->scan_hint &&
|
|
|
|
block->contig_hint_start > block->scan_hint_start &&
|
|
|
|
alloc_bits > block->scan_hint)
|
|
|
|
return block->scan_hint_start + block->scan_hint;
|
|
|
|
|
|
|
|
return block->first_free;
|
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:18 +08:00
|
|
|
/**
|
|
|
|
* pcpu_next_md_free_region - finds the next hint free area
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @bit_off: chunk offset
|
|
|
|
* @bits: size of free area
|
|
|
|
*
|
|
|
|
* Helper function for pcpu_for_each_md_free_region. It checks
|
|
|
|
* block->contig_hint and performs aggregation across blocks to find the
|
|
|
|
* next hint. It modifies bit_off and bits in-place to be consumed in the
|
|
|
|
* loop.
|
|
|
|
*/
|
|
|
|
static void pcpu_next_md_free_region(struct pcpu_chunk *chunk, int *bit_off,
|
|
|
|
int *bits)
|
|
|
|
{
|
|
|
|
int i = pcpu_off_to_block_index(*bit_off);
|
|
|
|
int block_off = pcpu_off_to_block_off(*bit_off);
|
|
|
|
struct pcpu_block_md *block;
|
|
|
|
|
|
|
|
*bits = 0;
|
|
|
|
for (block = chunk->md_blocks + i; i < pcpu_chunk_nr_blocks(chunk);
|
|
|
|
block++, i++) {
|
|
|
|
/* handles contig area across blocks */
|
|
|
|
if (*bits) {
|
|
|
|
*bits += block->left_free;
|
|
|
|
if (block->left_free == PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
continue;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This checks three things. First is there a contig_hint to
|
|
|
|
* check. Second, have we checked this hint before by
|
|
|
|
* comparing the block_off. Third, is this the same as the
|
|
|
|
* right contig hint. In the last case, it spills over into
|
|
|
|
* the next block and should be handled by the contig area
|
|
|
|
* across blocks code.
|
|
|
|
*/
|
|
|
|
*bits = block->contig_hint;
|
|
|
|
if (*bits && block->contig_hint_start >= block_off &&
|
|
|
|
*bits + block->contig_hint_start < PCPU_BITMAP_BLOCK_BITS) {
|
|
|
|
*bit_off = pcpu_block_off_to_off(i,
|
|
|
|
block->contig_hint_start);
|
|
|
|
return;
|
|
|
|
}
|
2017-09-28 05:35:00 +08:00
|
|
|
/* reset to satisfy the second predicate above */
|
|
|
|
block_off = 0;
|
2017-07-25 07:02:18 +08:00
|
|
|
|
|
|
|
*bits = block->right_free;
|
|
|
|
*bit_off = (i + 1) * PCPU_BITMAP_BLOCK_BITS - block->right_free;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:19 +08:00
|
|
|
/**
|
|
|
|
* pcpu_next_fit_region - finds fit areas for a given allocation request
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @alloc_bits: size of allocation
|
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
|
|
|
* @bit_off: chunk offset
|
|
|
|
* @bits: size of free area
|
|
|
|
*
|
|
|
|
* Finds the next free region that is viable for use with a given size and
|
|
|
|
* alignment. This only returns if there is a valid area to be used for this
|
|
|
|
* allocation. block->first_free is returned if the allocation request fits
|
|
|
|
* within the block to see if the request can be fulfilled prior to the contig
|
|
|
|
* hint.
|
|
|
|
*/
|
|
|
|
static void pcpu_next_fit_region(struct pcpu_chunk *chunk, int alloc_bits,
|
|
|
|
int align, int *bit_off, int *bits)
|
|
|
|
{
|
|
|
|
int i = pcpu_off_to_block_index(*bit_off);
|
|
|
|
int block_off = pcpu_off_to_block_off(*bit_off);
|
|
|
|
struct pcpu_block_md *block;
|
|
|
|
|
|
|
|
*bits = 0;
|
|
|
|
for (block = chunk->md_blocks + i; i < pcpu_chunk_nr_blocks(chunk);
|
|
|
|
block++, i++) {
|
|
|
|
/* handles contig area across blocks */
|
|
|
|
if (*bits) {
|
|
|
|
*bits += block->left_free;
|
|
|
|
if (*bits >= alloc_bits)
|
|
|
|
return;
|
|
|
|
if (block->left_free == PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* check block->contig_hint */
|
|
|
|
*bits = ALIGN(block->contig_hint_start, align) -
|
|
|
|
block->contig_hint_start;
|
|
|
|
/*
|
|
|
|
* This uses the block offset to determine if this has been
|
|
|
|
* checked in the prior iteration.
|
|
|
|
*/
|
|
|
|
if (block->contig_hint &&
|
|
|
|
block->contig_hint_start >= block_off &&
|
|
|
|
block->contig_hint >= *bits + alloc_bits) {
|
2019-02-26 05:41:45 +08:00
|
|
|
int start = pcpu_next_hint(block, alloc_bits);
|
|
|
|
|
2017-07-25 07:02:19 +08:00
|
|
|
*bits += alloc_bits + block->contig_hint_start -
|
2019-02-26 05:41:45 +08:00
|
|
|
start;
|
|
|
|
*bit_off = pcpu_block_off_to_off(i, start);
|
2017-07-25 07:02:19 +08:00
|
|
|
return;
|
|
|
|
}
|
2017-09-28 05:35:00 +08:00
|
|
|
/* reset to satisfy the second predicate above */
|
|
|
|
block_off = 0;
|
2017-07-25 07:02:19 +08:00
|
|
|
|
|
|
|
*bit_off = ALIGN(PCPU_BITMAP_BLOCK_BITS - block->right_free,
|
|
|
|
align);
|
|
|
|
*bits = PCPU_BITMAP_BLOCK_BITS - *bit_off;
|
|
|
|
*bit_off = pcpu_block_off_to_off(i, *bit_off);
|
|
|
|
if (*bits >= alloc_bits)
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* no valid offsets were found - fail condition */
|
|
|
|
*bit_off = pcpu_chunk_map_bits(chunk);
|
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:18 +08:00
|
|
|
/*
|
|
|
|
* Metadata free area iterators. These perform aggregation of free areas
|
|
|
|
* based on the metadata blocks and return the offset @bit_off and size in
|
2017-07-25 07:02:19 +08:00
|
|
|
* bits of the free area @bits. pcpu_for_each_fit_region only returns when
|
|
|
|
* a fit is found for the allocation request.
|
2017-07-25 07:02:18 +08:00
|
|
|
*/
|
|
|
|
#define pcpu_for_each_md_free_region(chunk, bit_off, bits) \
|
|
|
|
for (pcpu_next_md_free_region((chunk), &(bit_off), &(bits)); \
|
|
|
|
(bit_off) < pcpu_chunk_map_bits((chunk)); \
|
|
|
|
(bit_off) += (bits) + 1, \
|
|
|
|
pcpu_next_md_free_region((chunk), &(bit_off), &(bits)))
|
|
|
|
|
2017-07-25 07:02:19 +08:00
|
|
|
#define pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) \
|
|
|
|
for (pcpu_next_fit_region((chunk), (alloc_bits), (align), &(bit_off), \
|
|
|
|
&(bits)); \
|
|
|
|
(bit_off) < pcpu_chunk_map_bits((chunk)); \
|
|
|
|
(bit_off) += (bits), \
|
|
|
|
pcpu_next_fit_region((chunk), (alloc_bits), (align), &(bit_off), \
|
|
|
|
&(bits)))
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
2011-08-04 17:02:33 +08:00
|
|
|
* pcpu_mem_zalloc - allocate memory
|
2009-03-06 23:44:09 +08:00
|
|
|
* @size: bytes to allocate
|
2018-02-17 02:07:19 +08:00
|
|
|
* @gfp: allocation flags
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
2009-03-06 23:44:09 +08:00
|
|
|
* Allocate @size bytes. If @size is smaller than PAGE_SIZE,
|
2018-02-17 02:07:19 +08:00
|
|
|
* kzalloc() is used; otherwise, the equivalent of vzalloc() is used.
|
|
|
|
* This is to facilitate passing through whitelisted flags. The
|
|
|
|
* returned memory is always zeroed.
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
|
|
|
* RETURNS:
|
2009-03-06 23:44:09 +08:00
|
|
|
* Pointer to the allocated area on success, NULL on failure.
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
2018-02-17 02:07:19 +08:00
|
|
|
static void *pcpu_mem_zalloc(size_t size, gfp_t gfp)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2010-06-28 00:50:00 +08:00
|
|
|
if (WARN_ON_ONCE(!slab_is_available()))
|
|
|
|
return NULL;
|
|
|
|
|
2009-03-06 23:44:09 +08:00
|
|
|
if (size <= PAGE_SIZE)
|
2018-02-17 02:09:58 +08:00
|
|
|
return kzalloc(size, gfp);
|
2010-10-30 21:56:54 +08:00
|
|
|
else
|
2020-06-02 12:51:40 +08:00
|
|
|
return __vmalloc(size, gfp | __GFP_ZERO);
|
2009-03-06 23:44:09 +08:00
|
|
|
}
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2009-03-06 23:44:09 +08:00
|
|
|
/**
|
|
|
|
* pcpu_mem_free - free memory
|
|
|
|
* @ptr: memory to free
|
|
|
|
*
|
2011-08-04 17:02:33 +08:00
|
|
|
* Free @ptr. @ptr should have been allocated using pcpu_mem_zalloc().
|
2009-03-06 23:44:09 +08:00
|
|
|
*/
|
2016-01-23 07:11:02 +08:00
|
|
|
static void pcpu_mem_free(void *ptr)
|
2009-03-06 23:44:09 +08:00
|
|
|
{
|
2016-01-23 07:11:02 +08:00
|
|
|
kvfree(ptr);
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
2019-02-26 01:03:50 +08:00
|
|
|
static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
|
|
|
|
bool move_front)
|
|
|
|
{
|
|
|
|
if (chunk != pcpu_reserved_chunk) {
|
|
|
|
if (move_front)
|
2021-06-03 09:09:31 +08:00
|
|
|
list_move(&chunk->list, &pcpu_chunk_lists[slot]);
|
2019-02-26 01:03:50 +08:00
|
|
|
else
|
2021-06-03 09:09:31 +08:00
|
|
|
list_move_tail(&chunk->list, &pcpu_chunk_lists[slot]);
|
2019-02-26 01:03:50 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot)
|
|
|
|
{
|
|
|
|
__pcpu_chunk_move(chunk, slot, true);
|
|
|
|
}
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
|
|
|
* pcpu_chunk_relocate - put chunk in the appropriate chunk slot
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @oslot: the previous slot it was on
|
|
|
|
*
|
|
|
|
* This function is called after an allocation or free changed @chunk.
|
|
|
|
* New slot according to the changed state is determined and @chunk is
|
2009-03-06 13:33:59 +08:00
|
|
|
* moved to the slot. Note that the reserved chunk is never put on
|
|
|
|
* chunk slots.
|
2009-03-06 23:44:13 +08:00
|
|
|
*
|
|
|
|
* CONTEXT:
|
|
|
|
* pcpu_lock.
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
|
|
|
static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
|
|
|
|
{
|
|
|
|
int nslot = pcpu_chunk_slot(chunk);
|
|
|
|
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
/* leave isolated chunks in-place */
|
|
|
|
if (chunk->isolated)
|
|
|
|
return;
|
|
|
|
|
2019-02-26 01:03:50 +08:00
|
|
|
if (oslot != nslot)
|
|
|
|
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
|
2009-11-11 14:35:18 +08:00
|
|
|
}
|
|
|
|
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
static void pcpu_isolate_chunk(struct pcpu_chunk *chunk)
|
|
|
|
{
|
|
|
|
lockdep_assert_held(&pcpu_lock);
|
|
|
|
|
|
|
|
if (!chunk->isolated) {
|
|
|
|
chunk->isolated = true;
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_nr_empty_pop_pages -= chunk->nr_empty_pop_pages;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
}
|
2021-06-03 09:09:31 +08:00
|
|
|
list_move(&chunk->list, &pcpu_chunk_lists[pcpu_to_depopulate_slot]);
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_reintegrate_chunk(struct pcpu_chunk *chunk)
|
|
|
|
{
|
|
|
|
lockdep_assert_held(&pcpu_lock);
|
|
|
|
|
|
|
|
if (chunk->isolated) {
|
|
|
|
chunk->isolated = false;
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_nr_empty_pop_pages += chunk->nr_empty_pop_pages;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
pcpu_chunk_relocate(chunk, -1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-02-14 03:10:30 +08:00
|
|
|
/*
|
|
|
|
* pcpu_update_empty_pages - update empty page counters
|
2009-11-11 14:35:18 +08:00
|
|
|
* @chunk: chunk of interest
|
2019-02-14 03:10:30 +08:00
|
|
|
* @nr: nr of empty pages
|
2009-11-11 14:35:18 +08:00
|
|
|
*
|
2019-02-14 03:10:30 +08:00
|
|
|
* This is used to keep track of the empty pages now based on the premise
|
|
|
|
* a md_block covers a page. The hint update functions recognize if a block
|
|
|
|
* is made full or broken to calculate deltas for keeping track of free pages.
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
*/
|
2019-02-14 03:10:30 +08:00
|
|
|
static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
{
|
2019-02-14 03:10:30 +08:00
|
|
|
chunk->nr_empty_pop_pages += nr;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
if (chunk != pcpu_reserved_chunk && !chunk->isolated)
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_nr_empty_pop_pages += nr;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
}
|
|
|
|
|
2019-02-22 07:44:35 +08:00
|
|
|
/*
|
|
|
|
* pcpu_region_overlap - determines if two regions overlap
|
|
|
|
* @a: start of first region, inclusive
|
|
|
|
* @b: end of first region, exclusive
|
|
|
|
* @x: start of second region, inclusive
|
|
|
|
* @y: end of second region, exclusive
|
2009-11-11 14:35:18 +08:00
|
|
|
*
|
2019-02-22 07:44:35 +08:00
|
|
|
* This is used to determine if the hint region [a, b) overlaps with the
|
|
|
|
* allocated region [x, y).
|
2009-11-11 14:35:18 +08:00
|
|
|
*/
|
2019-02-22 07:44:35 +08:00
|
|
|
static inline bool pcpu_region_overlap(int a, int b, int x, int y)
|
2009-11-11 14:35:18 +08:00
|
|
|
{
|
2019-02-22 07:44:35 +08:00
|
|
|
return (a < y) && (x < b);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
}
|
2009-03-06 23:44:09 +08:00
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
/**
|
|
|
|
* pcpu_block_update - updates a block given a free area
|
|
|
|
* @block: block of interest
|
|
|
|
* @start: start offset in block
|
|
|
|
* @end: end offset in block
|
|
|
|
*
|
|
|
|
* Updates a block given a known free area. The region [start, end) is
|
2017-07-25 07:02:15 +08:00
|
|
|
* expected to be the entirety of the free area within a block. Chooses
|
|
|
|
* the best starting offset if the contig hints are equal.
|
2017-07-25 07:02:12 +08:00
|
|
|
*/
|
|
|
|
static void pcpu_block_update(struct pcpu_block_md *block, int start, int end)
|
|
|
|
{
|
|
|
|
int contig = end - start;
|
|
|
|
|
|
|
|
block->first_free = min(block->first_free, start);
|
|
|
|
if (start == 0)
|
|
|
|
block->left_free = contig;
|
|
|
|
|
2019-02-27 01:56:16 +08:00
|
|
|
if (end == block->nr_bits)
|
2017-07-25 07:02:12 +08:00
|
|
|
block->right_free = contig;
|
|
|
|
|
|
|
|
if (contig > block->contig_hint) {
|
2019-02-26 05:41:45 +08:00
|
|
|
/* promote the old contig_hint to be the new scan_hint */
|
|
|
|
if (start > block->contig_hint_start) {
|
|
|
|
if (block->contig_hint > block->scan_hint) {
|
|
|
|
block->scan_hint_start =
|
|
|
|
block->contig_hint_start;
|
|
|
|
block->scan_hint = block->contig_hint;
|
|
|
|
} else if (start < block->scan_hint_start) {
|
|
|
|
/*
|
|
|
|
* The old contig_hint == scan_hint. But, the
|
|
|
|
* new contig is larger so hold the invariant
|
|
|
|
* scan_hint_start < contig_hint_start.
|
|
|
|
*/
|
|
|
|
block->scan_hint = 0;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
block->scan_hint = 0;
|
|
|
|
}
|
2017-07-25 07:02:12 +08:00
|
|
|
block->contig_hint_start = start;
|
|
|
|
block->contig_hint = contig;
|
2019-02-26 05:41:45 +08:00
|
|
|
} else if (contig == block->contig_hint) {
|
|
|
|
if (block->contig_hint_start &&
|
|
|
|
(!start ||
|
|
|
|
__ffs(start) > __ffs(block->contig_hint_start))) {
|
|
|
|
/* start has a better alignment so use it */
|
|
|
|
block->contig_hint_start = start;
|
|
|
|
if (start < block->scan_hint_start &&
|
|
|
|
block->contig_hint > block->scan_hint)
|
|
|
|
block->scan_hint = 0;
|
|
|
|
} else if (start > block->scan_hint_start ||
|
|
|
|
block->contig_hint > block->scan_hint) {
|
|
|
|
/*
|
|
|
|
* Knowing contig == contig_hint, update the scan_hint
|
|
|
|
* if it is farther than or larger than the current
|
|
|
|
* scan_hint.
|
|
|
|
*/
|
|
|
|
block->scan_hint_start = start;
|
|
|
|
block->scan_hint = contig;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* The region is smaller than the contig_hint. So only update
|
|
|
|
* the scan_hint if it is larger than or equal and farther than
|
|
|
|
* the current scan_hint.
|
|
|
|
*/
|
|
|
|
if ((start < block->contig_hint_start &&
|
|
|
|
(contig > block->scan_hint ||
|
|
|
|
(contig == block->scan_hint &&
|
|
|
|
start > block->scan_hint_start)))) {
|
|
|
|
block->scan_hint_start = start;
|
|
|
|
block->scan_hint = contig;
|
|
|
|
}
|
2017-07-25 07:02:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-02-23 01:03:16 +08:00
|
|
|
/*
|
|
|
|
* pcpu_block_update_scan - update a block given a free area from a scan
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @bit_off: chunk offset
|
|
|
|
* @bits: size of free area
|
|
|
|
*
|
|
|
|
* Finding the final allocation spot first goes through pcpu_find_block_fit()
|
|
|
|
* to find a block that can hold the allocation and then pcpu_alloc_area()
|
|
|
|
* where a scan is used. When allocations require specific alignments,
|
|
|
|
* we can inadvertently create holes which will not be seen in the alloc
|
|
|
|
* or free paths.
|
|
|
|
*
|
|
|
|
* This takes a given free area hole and updates a block as it may change the
|
|
|
|
* scan_hint. We need to scan backwards to ensure we don't miss free bits
|
|
|
|
* from alignment.
|
|
|
|
*/
|
|
|
|
static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off,
|
|
|
|
int bits)
|
|
|
|
{
|
|
|
|
int s_off = pcpu_off_to_block_off(bit_off);
|
|
|
|
int e_off = s_off + bits;
|
|
|
|
int s_index, l_bit;
|
|
|
|
struct pcpu_block_md *block;
|
|
|
|
|
|
|
|
if (e_off > PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
return;
|
|
|
|
|
|
|
|
s_index = pcpu_off_to_block_index(bit_off);
|
|
|
|
block = chunk->md_blocks + s_index;
|
|
|
|
|
|
|
|
/* scan backwards in case of alignment skipping free bits */
|
|
|
|
l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), s_off);
|
|
|
|
s_off = (s_off == l_bit) ? 0 : l_bit + 1;
|
|
|
|
|
|
|
|
pcpu_block_update(block, s_off, e_off);
|
|
|
|
}
|
|
|
|
|
2019-02-27 02:00:08 +08:00
|
|
|
/**
|
|
|
|
* pcpu_chunk_refresh_hint - updates metadata about a chunk
|
|
|
|
* @chunk: chunk of interest
|
2019-02-27 02:46:48 +08:00
|
|
|
* @full_scan: if we should scan from the beginning
|
2019-02-27 02:00:08 +08:00
|
|
|
*
|
|
|
|
* Iterates over the metadata blocks to find the largest contig area.
|
2019-02-27 02:46:48 +08:00
|
|
|
* A full scan can be avoided on the allocation path as this is triggered
|
|
|
|
* if we broke the contig_hint. In doing so, the scan_hint will be before
|
|
|
|
* the contig_hint or after if the scan_hint == contig_hint. This cannot
|
|
|
|
* be prevented on freeing as we want to find the largest area possibly
|
|
|
|
* spanning blocks.
|
2019-02-27 02:00:08 +08:00
|
|
|
*/
|
2019-02-27 02:46:48 +08:00
|
|
|
static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool full_scan)
|
2019-02-27 02:00:08 +08:00
|
|
|
{
|
|
|
|
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
|
|
|
|
int bit_off, bits;
|
|
|
|
|
2019-02-27 02:46:48 +08:00
|
|
|
/* promote scan_hint to contig_hint */
|
|
|
|
if (!full_scan && chunk_md->scan_hint) {
|
|
|
|
bit_off = chunk_md->scan_hint_start + chunk_md->scan_hint;
|
|
|
|
chunk_md->contig_hint_start = chunk_md->scan_hint_start;
|
|
|
|
chunk_md->contig_hint = chunk_md->scan_hint;
|
|
|
|
chunk_md->scan_hint = 0;
|
|
|
|
} else {
|
|
|
|
bit_off = chunk_md->first_free;
|
|
|
|
chunk_md->contig_hint = 0;
|
|
|
|
}
|
2019-02-27 02:00:08 +08:00
|
|
|
|
|
|
|
bits = 0;
|
2019-12-14 08:22:10 +08:00
|
|
|
pcpu_for_each_md_free_region(chunk, bit_off, bits)
|
2019-02-27 02:00:08 +08:00
|
|
|
pcpu_block_update(chunk_md, bit_off, bit_off + bits);
|
2017-07-25 07:02:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_block_refresh_hint
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @index: index of the metadata block
|
|
|
|
*
|
|
|
|
* Scans over the block beginning at first_free and updates the block
|
|
|
|
* metadata accordingly.
|
|
|
|
*/
|
|
|
|
static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
|
|
|
|
{
|
|
|
|
struct pcpu_block_md *block = chunk->md_blocks + index;
|
|
|
|
unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
|
2019-12-14 08:22:10 +08:00
|
|
|
unsigned int rs, re, start; /* region start, region end */
|
2019-02-26 06:10:15 +08:00
|
|
|
|
|
|
|
/* promote scan_hint to contig_hint */
|
|
|
|
if (block->scan_hint) {
|
|
|
|
start = block->scan_hint_start + block->scan_hint;
|
|
|
|
block->contig_hint_start = block->scan_hint_start;
|
|
|
|
block->contig_hint = block->scan_hint;
|
|
|
|
block->scan_hint = 0;
|
|
|
|
} else {
|
|
|
|
start = block->first_free;
|
|
|
|
block->contig_hint = 0;
|
|
|
|
}
|
2017-07-25 07:02:12 +08:00
|
|
|
|
2019-02-26 06:10:15 +08:00
|
|
|
block->right_free = 0;
|
2017-07-25 07:02:12 +08:00
|
|
|
|
|
|
|
/* iterate over free areas and update the contig hints */
|
2019-12-14 08:22:10 +08:00
|
|
|
bitmap_for_each_clear_region(alloc_map, rs, re, start,
|
|
|
|
PCPU_BITMAP_BLOCK_BITS)
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_block_update(block, rs, re);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_block_update_hint_alloc - update hint on allocation path
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @bit_off: chunk offset
|
|
|
|
* @bits: size of request
|
2017-07-25 07:02:16 +08:00
|
|
|
*
|
|
|
|
* Updates metadata for the allocation path. The metadata only has to be
|
|
|
|
* refreshed by a full scan iff the chunk's contig hint is broken. Block level
|
|
|
|
* scans are required if the block's contig hint is broken.
|
2017-07-25 07:02:12 +08:00
|
|
|
*/
|
|
|
|
static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
|
|
|
|
int bits)
|
|
|
|
{
|
2019-02-27 02:00:08 +08:00
|
|
|
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
|
2019-02-14 03:10:30 +08:00
|
|
|
int nr_empty_pages = 0;
|
2017-07-25 07:02:12 +08:00
|
|
|
struct pcpu_block_md *s_block, *e_block, *block;
|
|
|
|
int s_index, e_index; /* block indexes of the freed allocation */
|
|
|
|
int s_off, e_off; /* block offsets of the freed allocation */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate per block offsets.
|
|
|
|
* The calculation uses an inclusive range, but the resulting offsets
|
|
|
|
* are [start, end). e_index always points to the last block in the
|
|
|
|
* range.
|
|
|
|
*/
|
|
|
|
s_index = pcpu_off_to_block_index(bit_off);
|
|
|
|
e_index = pcpu_off_to_block_index(bit_off + bits - 1);
|
|
|
|
s_off = pcpu_off_to_block_off(bit_off);
|
|
|
|
e_off = pcpu_off_to_block_off(bit_off + bits - 1) + 1;
|
|
|
|
|
|
|
|
s_block = chunk->md_blocks + s_index;
|
|
|
|
e_block = chunk->md_blocks + e_index;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Update s_block.
|
2017-07-25 07:02:16 +08:00
|
|
|
* block->first_free must be updated if the allocation takes its place.
|
|
|
|
* If the allocation breaks the contig_hint, a scan is required to
|
|
|
|
* restore this hint.
|
2017-07-25 07:02:12 +08:00
|
|
|
*/
|
2019-02-14 03:10:30 +08:00
|
|
|
if (s_block->contig_hint == PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
nr_empty_pages++;
|
|
|
|
|
2017-07-25 07:02:16 +08:00
|
|
|
if (s_off == s_block->first_free)
|
|
|
|
s_block->first_free = find_next_zero_bit(
|
|
|
|
pcpu_index_alloc_map(chunk, s_index),
|
|
|
|
PCPU_BITMAP_BLOCK_BITS,
|
|
|
|
s_off + bits);
|
|
|
|
|
2019-02-26 05:41:45 +08:00
|
|
|
if (pcpu_region_overlap(s_block->scan_hint_start,
|
|
|
|
s_block->scan_hint_start + s_block->scan_hint,
|
|
|
|
s_off,
|
|
|
|
s_off + bits))
|
|
|
|
s_block->scan_hint = 0;
|
|
|
|
|
2019-02-22 07:44:35 +08:00
|
|
|
if (pcpu_region_overlap(s_block->contig_hint_start,
|
|
|
|
s_block->contig_hint_start +
|
|
|
|
s_block->contig_hint,
|
|
|
|
s_off,
|
|
|
|
s_off + bits)) {
|
2017-07-25 07:02:16 +08:00
|
|
|
/* block contig hint is broken - scan to fix it */
|
2019-02-26 06:10:15 +08:00
|
|
|
if (!s_off)
|
|
|
|
s_block->left_free = 0;
|
2017-07-25 07:02:16 +08:00
|
|
|
pcpu_block_refresh_hint(chunk, s_index);
|
|
|
|
} else {
|
|
|
|
/* update left and right contig manually */
|
|
|
|
s_block->left_free = min(s_block->left_free, s_off);
|
|
|
|
if (s_index == e_index)
|
|
|
|
s_block->right_free = min_t(int, s_block->right_free,
|
|
|
|
PCPU_BITMAP_BLOCK_BITS - e_off);
|
|
|
|
else
|
|
|
|
s_block->right_free = 0;
|
|
|
|
}
|
2017-07-25 07:02:12 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Update e_block.
|
|
|
|
*/
|
|
|
|
if (s_index != e_index) {
|
2019-02-14 03:10:30 +08:00
|
|
|
if (e_block->contig_hint == PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
nr_empty_pages++;
|
|
|
|
|
2017-07-25 07:02:16 +08:00
|
|
|
/*
|
|
|
|
* When the allocation is across blocks, the end is along
|
|
|
|
* the left part of the e_block.
|
|
|
|
*/
|
|
|
|
e_block->first_free = find_next_zero_bit(
|
|
|
|
pcpu_index_alloc_map(chunk, e_index),
|
|
|
|
PCPU_BITMAP_BLOCK_BITS, e_off);
|
|
|
|
|
|
|
|
if (e_off == PCPU_BITMAP_BLOCK_BITS) {
|
|
|
|
/* reset the block */
|
|
|
|
e_block++;
|
|
|
|
} else {
|
2019-02-26 05:41:45 +08:00
|
|
|
if (e_off > e_block->scan_hint_start)
|
|
|
|
e_block->scan_hint = 0;
|
|
|
|
|
2019-02-26 06:10:15 +08:00
|
|
|
e_block->left_free = 0;
|
2017-07-25 07:02:16 +08:00
|
|
|
if (e_off > e_block->contig_hint_start) {
|
|
|
|
/* contig hint is broken - scan to fix it */
|
|
|
|
pcpu_block_refresh_hint(chunk, e_index);
|
|
|
|
} else {
|
|
|
|
e_block->right_free =
|
|
|
|
min_t(int, e_block->right_free,
|
|
|
|
PCPU_BITMAP_BLOCK_BITS - e_off);
|
|
|
|
}
|
|
|
|
}
|
2017-07-25 07:02:12 +08:00
|
|
|
|
|
|
|
/* update in-between md_blocks */
|
2019-02-14 03:10:30 +08:00
|
|
|
nr_empty_pages += (e_index - s_index - 1);
|
2017-07-25 07:02:12 +08:00
|
|
|
for (block = s_block + 1; block < e_block; block++) {
|
2019-02-26 05:41:45 +08:00
|
|
|
block->scan_hint = 0;
|
2017-07-25 07:02:12 +08:00
|
|
|
block->contig_hint = 0;
|
|
|
|
block->left_free = 0;
|
|
|
|
block->right_free = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-02-14 03:10:30 +08:00
|
|
|
if (nr_empty_pages)
|
|
|
|
pcpu_update_empty_pages(chunk, -nr_empty_pages);
|
|
|
|
|
2019-02-27 02:46:48 +08:00
|
|
|
if (pcpu_region_overlap(chunk_md->scan_hint_start,
|
|
|
|
chunk_md->scan_hint_start +
|
|
|
|
chunk_md->scan_hint,
|
|
|
|
bit_off,
|
|
|
|
bit_off + bits))
|
|
|
|
chunk_md->scan_hint = 0;
|
|
|
|
|
2017-07-25 07:02:16 +08:00
|
|
|
/*
|
|
|
|
* The only time a full chunk scan is required is if the chunk
|
|
|
|
* contig hint is broken. Otherwise, it means a smaller space
|
|
|
|
* was used and therefore the chunk contig hint is still correct.
|
|
|
|
*/
|
2019-02-27 02:00:08 +08:00
|
|
|
if (pcpu_region_overlap(chunk_md->contig_hint_start,
|
|
|
|
chunk_md->contig_hint_start +
|
|
|
|
chunk_md->contig_hint,
|
2019-02-22 07:44:35 +08:00
|
|
|
bit_off,
|
|
|
|
bit_off + bits))
|
2019-02-27 02:46:48 +08:00
|
|
|
pcpu_chunk_refresh_hint(chunk, false);
|
2017-07-25 07:02:12 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_block_update_hint_free - updates the block hints on the free path
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @bit_off: chunk offset
|
|
|
|
* @bits: size of request
|
2017-07-25 07:02:17 +08:00
|
|
|
*
|
|
|
|
* Updates metadata for the allocation path. This avoids a blind block
|
|
|
|
* refresh by making use of the block contig hints. If this fails, it scans
|
|
|
|
* forward and backward to determine the extent of the free area. This is
|
|
|
|
* capped at the boundary of blocks.
|
|
|
|
*
|
|
|
|
* A chunk update is triggered if a page becomes free, a block becomes free,
|
|
|
|
* or the free spans across blocks. This tradeoff is to minimize iterating
|
2019-02-27 02:00:08 +08:00
|
|
|
* over the block metadata to update chunk_md->contig_hint.
|
|
|
|
* chunk_md->contig_hint may be off by up to a page, but it will never be more
|
|
|
|
* than the available space. If the contig hint is contained in one block, it
|
|
|
|
* will be accurate.
|
2017-07-25 07:02:12 +08:00
|
|
|
*/
|
|
|
|
static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
|
|
|
|
int bits)
|
|
|
|
{
|
2019-02-14 03:10:30 +08:00
|
|
|
int nr_empty_pages = 0;
|
2017-07-25 07:02:12 +08:00
|
|
|
struct pcpu_block_md *s_block, *e_block, *block;
|
|
|
|
int s_index, e_index; /* block indexes of the freed allocation */
|
|
|
|
int s_off, e_off; /* block offsets of the freed allocation */
|
2017-07-25 07:02:17 +08:00
|
|
|
int start, end; /* start and end of the whole free area */
|
2017-07-25 07:02:12 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate per block offsets.
|
|
|
|
* The calculation uses an inclusive range, but the resulting offsets
|
|
|
|
* are [start, end). e_index always points to the last block in the
|
|
|
|
* range.
|
|
|
|
*/
|
|
|
|
s_index = pcpu_off_to_block_index(bit_off);
|
|
|
|
e_index = pcpu_off_to_block_index(bit_off + bits - 1);
|
|
|
|
s_off = pcpu_off_to_block_off(bit_off);
|
|
|
|
e_off = pcpu_off_to_block_off(bit_off + bits - 1) + 1;
|
|
|
|
|
|
|
|
s_block = chunk->md_blocks + s_index;
|
|
|
|
e_block = chunk->md_blocks + e_index;
|
|
|
|
|
2017-07-25 07:02:17 +08:00
|
|
|
/*
|
|
|
|
* Check if the freed area aligns with the block->contig_hint.
|
|
|
|
* If it does, then the scan to find the beginning/end of the
|
|
|
|
* larger free area can be avoided.
|
|
|
|
*
|
|
|
|
* start and end refer to beginning and end of the free area
|
|
|
|
* within each their respective blocks. This is not necessarily
|
|
|
|
* the entire free area as it may span blocks past the beginning
|
|
|
|
* or end of the block.
|
|
|
|
*/
|
|
|
|
start = s_off;
|
|
|
|
if (s_off == s_block->contig_hint + s_block->contig_hint_start) {
|
|
|
|
start = s_block->contig_hint_start;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Scan backwards to find the extent of the free area.
|
|
|
|
* find_last_bit returns the starting bit, so if the start bit
|
|
|
|
* is returned, that means there was no last bit and the
|
|
|
|
* remainder of the chunk is free.
|
|
|
|
*/
|
|
|
|
int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index),
|
|
|
|
start);
|
|
|
|
start = (start == l_bit) ? 0 : l_bit + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
end = e_off;
|
|
|
|
if (e_off == e_block->contig_hint_start)
|
|
|
|
end = e_block->contig_hint_start + e_block->contig_hint;
|
|
|
|
else
|
|
|
|
end = find_next_bit(pcpu_index_alloc_map(chunk, e_index),
|
|
|
|
PCPU_BITMAP_BLOCK_BITS, end);
|
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
/* update s_block */
|
2017-07-25 07:02:17 +08:00
|
|
|
e_off = (s_index == e_index) ? end : PCPU_BITMAP_BLOCK_BITS;
|
2019-02-14 03:10:30 +08:00
|
|
|
if (!start && e_off == PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
nr_empty_pages++;
|
2017-07-25 07:02:17 +08:00
|
|
|
pcpu_block_update(s_block, start, e_off);
|
2017-07-25 07:02:12 +08:00
|
|
|
|
|
|
|
/* freeing in the same block */
|
|
|
|
if (s_index != e_index) {
|
|
|
|
/* update e_block */
|
2019-02-14 03:10:30 +08:00
|
|
|
if (end == PCPU_BITMAP_BLOCK_BITS)
|
|
|
|
nr_empty_pages++;
|
2017-07-25 07:02:17 +08:00
|
|
|
pcpu_block_update(e_block, 0, end);
|
2017-07-25 07:02:12 +08:00
|
|
|
|
|
|
|
/* reset md_blocks in the middle */
|
2019-02-14 03:10:30 +08:00
|
|
|
nr_empty_pages += (e_index - s_index - 1);
|
2017-07-25 07:02:12 +08:00
|
|
|
for (block = s_block + 1; block < e_block; block++) {
|
|
|
|
block->first_free = 0;
|
2019-02-26 05:41:45 +08:00
|
|
|
block->scan_hint = 0;
|
2017-07-25 07:02:12 +08:00
|
|
|
block->contig_hint_start = 0;
|
|
|
|
block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
|
|
|
|
block->left_free = PCPU_BITMAP_BLOCK_BITS;
|
|
|
|
block->right_free = PCPU_BITMAP_BLOCK_BITS;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-02-14 03:10:30 +08:00
|
|
|
if (nr_empty_pages)
|
|
|
|
pcpu_update_empty_pages(chunk, nr_empty_pages);
|
|
|
|
|
2017-07-25 07:02:17 +08:00
|
|
|
/*
|
2019-02-14 03:10:30 +08:00
|
|
|
* Refresh chunk metadata when the free makes a block free or spans
|
|
|
|
* across blocks. The contig_hint may be off by up to a page, but if
|
|
|
|
* the contig_hint is contained in a block, it will be accurate with
|
|
|
|
* the else condition below.
|
2017-07-25 07:02:17 +08:00
|
|
|
*/
|
2019-02-14 03:10:30 +08:00
|
|
|
if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
|
2019-02-27 02:46:48 +08:00
|
|
|
pcpu_chunk_refresh_hint(chunk, true);
|
2017-07-25 07:02:17 +08:00
|
|
|
else
|
2019-02-27 02:00:08 +08:00
|
|
|
pcpu_block_update(&chunk->chunk_md,
|
|
|
|
pcpu_block_off_to_off(s_index, start),
|
|
|
|
end);
|
2017-07-25 07:02:12 +08:00
|
|
|
}
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/**
|
|
|
|
* pcpu_is_populated - determines if the region is populated
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @bit_off: chunk offset
|
|
|
|
* @bits: size of area
|
|
|
|
* @next_off: return value for the next offset to start searching
|
|
|
|
*
|
|
|
|
* For atomic allocations, check if the backing pages are populated.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Bool if the backing pages are populated.
|
|
|
|
* next_index is to skip over unpopulated blocks in pcpu_find_block_fit.
|
|
|
|
*/
|
|
|
|
static bool pcpu_is_populated(struct pcpu_chunk *chunk, int bit_off, int bits,
|
|
|
|
int *next_off)
|
|
|
|
{
|
2019-12-14 08:22:10 +08:00
|
|
|
unsigned int page_start, page_end, rs, re;
|
2009-11-11 14:35:18 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
page_start = PFN_DOWN(bit_off * PCPU_MIN_ALLOC_SIZE);
|
|
|
|
page_end = PFN_UP((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
|
2009-11-11 14:35:18 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
rs = page_start;
|
2019-12-14 08:22:10 +08:00
|
|
|
bitmap_next_clear_region(chunk->populated, &rs, &re, page_end);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
if (rs >= page_end)
|
|
|
|
return true;
|
2009-11-11 14:35:18 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
*next_off = re * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
|
|
|
|
return false;
|
2009-03-06 23:44:09 +08:00
|
|
|
}
|
|
|
|
|
2014-09-03 02:46:02 +08:00
|
|
|
/**
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* pcpu_find_block_fit - finds the block index to start searching
|
|
|
|
* @chunk: chunk of interest
|
|
|
|
* @alloc_bits: size of request in allocation units
|
|
|
|
* @align: alignment of area (max PAGE_SIZE bytes)
|
|
|
|
* @pop_only: use populated regions only
|
|
|
|
*
|
2017-07-25 07:02:19 +08:00
|
|
|
* Given a chunk and an allocation spec, find the offset to begin searching
|
|
|
|
* for a free region. This iterates over the bitmap metadata blocks to
|
|
|
|
* find an offset that will be guaranteed to fit the requirements. It is
|
|
|
|
* not quite first fit as if the allocation does not fit in the contig hint
|
|
|
|
* of a block or chunk, it is skipped. This errs on the side of caution
|
|
|
|
* to prevent excess iteration. Poor alignment can cause the allocator to
|
|
|
|
* skip over blocks and chunks that have valid free areas.
|
|
|
|
*
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* RETURNS:
|
|
|
|
* The offset in the bitmap to begin searching.
|
|
|
|
* -1 if no offset is found.
|
2014-09-03 02:46:02 +08:00
|
|
|
*/
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
|
|
|
|
size_t align, bool pop_only)
|
2014-09-03 02:46:02 +08:00
|
|
|
{
|
2019-02-27 02:00:08 +08:00
|
|
|
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
|
2017-07-25 07:02:19 +08:00
|
|
|
int bit_off, bits, next_off;
|
2014-09-03 02:46:02 +08:00
|
|
|
|
2017-07-25 07:02:14 +08:00
|
|
|
/*
|
2021-04-08 11:57:35 +08:00
|
|
|
* This is an optimization to prevent scanning by assuming if the
|
|
|
|
* allocation cannot fit in the global hint, there is memory pressure
|
|
|
|
* and creating a new chunk would happen soon.
|
2017-07-25 07:02:14 +08:00
|
|
|
*/
|
2021-04-08 11:57:35 +08:00
|
|
|
if (!pcpu_check_block_hint(chunk_md, alloc_bits, align))
|
2017-07-25 07:02:14 +08:00
|
|
|
return -1;
|
|
|
|
|
2019-02-27 02:46:48 +08:00
|
|
|
bit_off = pcpu_next_hint(chunk_md, alloc_bits);
|
2017-07-25 07:02:19 +08:00
|
|
|
bits = 0;
|
|
|
|
pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
if (!pop_only || pcpu_is_populated(chunk, bit_off, bits,
|
2017-07-25 07:02:19 +08:00
|
|
|
&next_off))
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
break;
|
2014-09-03 02:46:02 +08:00
|
|
|
|
2017-07-25 07:02:19 +08:00
|
|
|
bit_off = next_off;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
bits = 0;
|
2014-09-03 02:46:02 +08:00
|
|
|
}
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
|
|
|
|
if (bit_off == pcpu_chunk_map_bits(chunk))
|
|
|
|
return -1;
|
|
|
|
|
|
|
|
return bit_off;
|
2014-09-03 02:46:02 +08:00
|
|
|
}
|
|
|
|
|
2019-02-23 01:03:16 +08:00
|
|
|
/*
|
|
|
|
* pcpu_find_zero_area - modified from bitmap_find_next_zero_area_off()
|
|
|
|
* @map: the address to base the search on
|
|
|
|
* @size: the bitmap size in bits
|
|
|
|
* @start: the bitnumber to start searching at
|
|
|
|
* @nr: the number of zeroed bits we're looking for
|
|
|
|
* @align_mask: alignment mask for zero area
|
|
|
|
* @largest_off: offset of the largest area skipped
|
|
|
|
* @largest_bits: size of the largest area skipped
|
|
|
|
*
|
|
|
|
* The @align_mask should be one less than a power of 2.
|
|
|
|
*
|
|
|
|
* This is a modified version of bitmap_find_next_zero_area_off() to remember
|
|
|
|
* the largest area that was skipped. This is imperfect, but in general is
|
|
|
|
* good enough. The largest remembered region is the largest failed region
|
|
|
|
* seen. This does not include anything we possibly skipped due to alignment.
|
|
|
|
* pcpu_block_update_scan() does scan backwards to try and recover what was
|
|
|
|
* lost to alignment. While this can cause scanning to miss earlier possible
|
|
|
|
* free areas, smaller allocations will eventually fill those holes.
|
|
|
|
*/
|
|
|
|
static unsigned long pcpu_find_zero_area(unsigned long *map,
|
|
|
|
unsigned long size,
|
|
|
|
unsigned long start,
|
|
|
|
unsigned long nr,
|
|
|
|
unsigned long align_mask,
|
|
|
|
unsigned long *largest_off,
|
|
|
|
unsigned long *largest_bits)
|
|
|
|
{
|
|
|
|
unsigned long index, end, i, area_off, area_bits;
|
|
|
|
again:
|
|
|
|
index = find_next_zero_bit(map, size, start);
|
|
|
|
|
|
|
|
/* Align allocation */
|
|
|
|
index = __ALIGN_MASK(index, align_mask);
|
|
|
|
area_off = index;
|
|
|
|
|
|
|
|
end = index + nr;
|
|
|
|
if (end > size)
|
|
|
|
return end;
|
|
|
|
i = find_next_bit(map, end, index);
|
|
|
|
if (i < end) {
|
|
|
|
area_bits = i - area_off;
|
|
|
|
/* remember largest unused area with best alignment */
|
|
|
|
if (area_bits > *largest_bits ||
|
|
|
|
(area_bits == *largest_bits && *largest_off &&
|
|
|
|
(!area_off || __ffs(area_off) > __ffs(*largest_off)))) {
|
|
|
|
*largest_off = area_off;
|
|
|
|
*largest_bits = area_bits;
|
|
|
|
}
|
|
|
|
|
|
|
|
start = i + 1;
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
return index;
|
|
|
|
}
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* pcpu_alloc_area - allocates an area from a pcpu_chunk
|
2009-02-20 15:29:08 +08:00
|
|
|
* @chunk: chunk of interest
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* @alloc_bits: size of request in allocation units
|
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
|
|
|
* @start: bit_off to start searching
|
2009-03-06 23:44:09 +08:00
|
|
|
*
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* This function takes in a @start offset to begin searching to fit an
|
2017-07-25 07:02:19 +08:00
|
|
|
* allocation of @alloc_bits with alignment @align. It needs to scan
|
|
|
|
* the allocation map because if it fits within the block's contig hint,
|
|
|
|
* @start will be block->first_free. This is an attempt to fill the
|
|
|
|
* allocation prior to breaking the contig hint. The allocation and
|
|
|
|
* boundary maps are updated accordingly if it confirms a valid
|
|
|
|
* free area.
|
2009-03-06 23:44:13 +08:00
|
|
|
*
|
2009-02-20 15:29:08 +08:00
|
|
|
* RETURNS:
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* Allocated addr offset in @chunk on success.
|
|
|
|
* -1 if no matching area is found.
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
|
|
|
|
size_t align, int start)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2019-02-27 02:00:08 +08:00
|
|
|
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
size_t align_mask = (align) ? (align - 1) : 0;
|
2019-02-23 01:03:16 +08:00
|
|
|
unsigned long area_off = 0, area_bits = 0;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
int bit_off, end, oslot;
|
2014-09-03 02:46:02 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
lockdep_assert_held(&pcpu_lock);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
oslot = pcpu_chunk_slot(chunk);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/*
|
|
|
|
* Search to find a fit.
|
|
|
|
*/
|
2019-02-22 07:54:11 +08:00
|
|
|
end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
|
|
|
|
pcpu_chunk_map_bits(chunk));
|
2019-02-23 01:03:16 +08:00
|
|
|
bit_off = pcpu_find_zero_area(chunk->alloc_map, end, start, alloc_bits,
|
|
|
|
align_mask, &area_off, &area_bits);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
if (bit_off >= end)
|
|
|
|
return -1;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2019-02-23 01:03:16 +08:00
|
|
|
if (area_bits)
|
|
|
|
pcpu_block_update_scan(chunk, area_off, area_bits);
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/* update alloc map */
|
|
|
|
bitmap_set(chunk->alloc_map, bit_off, alloc_bits);
|
2014-03-07 09:52:32 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/* update boundary map */
|
|
|
|
set_bit(bit_off, chunk->bound_map);
|
|
|
|
bitmap_clear(chunk->bound_map, bit_off + 1, alloc_bits - 1);
|
|
|
|
set_bit(bit_off + alloc_bits, chunk->bound_map);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
chunk->free_bytes -= alloc_bits * PCPU_MIN_ALLOC_SIZE;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2017-07-25 07:02:13 +08:00
|
|
|
/* update first free bit */
|
2019-02-27 02:00:08 +08:00
|
|
|
if (bit_off == chunk_md->first_free)
|
|
|
|
chunk_md->first_free = find_next_zero_bit(
|
2017-07-25 07:02:13 +08:00
|
|
|
chunk->alloc_map,
|
|
|
|
pcpu_chunk_map_bits(chunk),
|
|
|
|
bit_off + alloc_bits);
|
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_block_update_hint_alloc(chunk, bit_off, alloc_bits);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
|
|
|
pcpu_chunk_relocate(chunk, oslot);
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
return bit_off * PCPU_MIN_ALLOC_SIZE;
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* pcpu_free_area - frees the corresponding offset
|
2009-02-20 15:29:08 +08:00
|
|
|
* @chunk: chunk of interest
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* @off: addr offset into chunk
|
2009-03-06 23:44:13 +08:00
|
|
|
*
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* This function determines the size of an allocation to free using
|
|
|
|
* the boundary bitmap and clears the allocation map.
|
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:14 +08:00
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Number of freed bytes.
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:14 +08:00
|
|
|
static int pcpu_free_area(struct pcpu_chunk *chunk, int off)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2019-02-27 02:00:08 +08:00
|
|
|
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
|
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:14 +08:00
|
|
|
int bit_off, bits, end, oslot, freed;
|
percpu: store offsets instead of lengths in ->map[]
Current code keeps +-length for each area in chunk->map[]. It has
several unpleasant consequences:
* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in. Note that offsets are monotonous, so simple
binary search would do here.
New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>. The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.
In other words, where the old variant would have
4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
0, 5, 13, 16, 21, 32, 133
This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do. Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 10:13:18 +08:00
|
|
|
|
2017-06-20 07:28:29 +08:00
|
|
|
lockdep_assert_held(&pcpu_lock);
|
2017-06-20 07:28:31 +08:00
|
|
|
pcpu_stats_area_dealloc(chunk);
|
2017-06-20 07:28:29 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
oslot = pcpu_chunk_slot(chunk);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
bit_off = off / PCPU_MIN_ALLOC_SIZE;
|
2014-03-07 09:52:32 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/* find end index */
|
|
|
|
end = find_next_bit(chunk->bound_map, pcpu_chunk_map_bits(chunk),
|
|
|
|
bit_off + 1);
|
|
|
|
bits = end - bit_off;
|
|
|
|
bitmap_clear(chunk->alloc_map, bit_off, bits);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:14 +08:00
|
|
|
freed = bits * PCPU_MIN_ALLOC_SIZE;
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/* update metadata */
|
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:14 +08:00
|
|
|
chunk->free_bytes += freed;
|
2014-09-03 02:46:05 +08:00
|
|
|
|
2017-07-25 07:02:13 +08:00
|
|
|
/* update first free bit */
|
2019-02-27 02:00:08 +08:00
|
|
|
chunk_md->first_free = min(chunk_md->first_free, bit_off);
|
2017-07-25 07:02:13 +08:00
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_block_update_hint_free(chunk, bit_off, bits);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
|
|
|
pcpu_chunk_relocate(chunk, oslot);
|
percpu: return number of released bytes from pcpu_free_area()
Patch series "mm: memcg accounting of percpu memory", v3.
This patchset adds percpu memory accounting to memory cgroups. It's based
on the rework of the slab controller and reuses concepts and features
introduced for the per-object slab accounting.
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
Percpu allocations by their nature are scattered over multiple pages, so
they can't be tracked on the per-page basis. So the per-object tracking
introduced by the new slab controller is reused.
The patchset implements charging of percpu allocations, adds memcg-level
statistics, enables accounting for percpu allocations made by memory
cgroup internals and provides some basic tests.
To implement the accounting of percpu memory without a significant memory
and performance overhead the following approach is used: all accounted
allocations are placed into a separate percpu chunk (or chunks). These
chunks are similar to default chunks, except that they do have an attached
vector of pointers to obj_cgroup objects, which is big enough to save a
pointer for each allocated object. On the allocation, if the allocation
has to be accounted (__GFP_ACCOUNT is passed, the allocating process
belongs to a non-root memory cgroup, etc), the memory cgroup is getting
charged and if the maximum limit is not exceeded the allocation is
performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the
allocation is forced over the limit, depending on gfp (as any other kernel
memory allocation). The memory cgroup information is saved in the
obj_cgroup vector at the corresponding offset. On the release time the
memcg information is restored from the vector and the cgroup is getting
uncharged. Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no additional
overhead is expected.
To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped to
point at the parent cgroup, so that the original memory cgroup can be
released prior to all objects, which has been charged to it. Because all
charges and statistics are fully recursive, it's perfectly correct to
uncharge the parent cgroup instead. This scheme is used in the slab
memory accounting, and percpu memory can just follow the scheme.
This patch (of 5):
To implement accounting of percpu memory we need the information about the
size of freed object. Return it from pcpu_free_area().
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
cC: Michal Koutnýutny@suse.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:14 +08:00
|
|
|
|
|
|
|
return freed;
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
2019-02-27 01:56:16 +08:00
|
|
|
static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
|
|
|
|
{
|
|
|
|
block->scan_hint = 0;
|
|
|
|
block->contig_hint = nr_bits;
|
|
|
|
block->left_free = nr_bits;
|
|
|
|
block->right_free = nr_bits;
|
|
|
|
block->first_free = 0;
|
|
|
|
block->nr_bits = nr_bits;
|
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
|
|
|
|
{
|
|
|
|
struct pcpu_block_md *md_block;
|
|
|
|
|
2019-02-27 02:00:08 +08:00
|
|
|
/* init the chunk's block */
|
|
|
|
pcpu_init_md_block(&chunk->chunk_md, pcpu_chunk_map_bits(chunk));
|
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
for (md_block = chunk->md_blocks;
|
|
|
|
md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
|
2019-02-27 01:56:16 +08:00
|
|
|
md_block++)
|
|
|
|
pcpu_init_md_block(md_block, PCPU_BITMAP_BLOCK_BITS);
|
2017-07-25 07:02:12 +08:00
|
|
|
}
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/**
|
|
|
|
* pcpu_alloc_first_chunk - creates chunks that serve the first chunk
|
|
|
|
* @tmp_addr: the start of the region served
|
|
|
|
* @map_size: size of the region served
|
|
|
|
*
|
|
|
|
* This is responsible for creating the chunks that serve the first chunk. The
|
|
|
|
* base_addr is page aligned down of @tmp_addr while the region end is page
|
|
|
|
* aligned up. Offsets are kept track of to determine the region served. All
|
|
|
|
* this is done to appease the bitmap allocator in avoiding partial blocks.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Chunk serving the region at @tmp_addr of @map_size.
|
|
|
|
*/
|
2017-07-25 07:02:05 +08:00
|
|
|
static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
int map_size)
|
2017-07-25 07:02:02 +08:00
|
|
|
{
|
|
|
|
struct pcpu_chunk *chunk;
|
2017-07-25 07:02:12 +08:00
|
|
|
unsigned long aligned_addr, lcm_align;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
int start_offset, offset_bits, region_size, region_bits;
|
2019-03-12 14:30:15 +08:00
|
|
|
size_t alloc_size;
|
2017-07-25 07:02:05 +08:00
|
|
|
|
|
|
|
/* region calculations */
|
|
|
|
aligned_addr = tmp_addr & PAGE_MASK;
|
|
|
|
|
|
|
|
start_offset = tmp_addr - aligned_addr;
|
2017-07-25 07:02:03 +08:00
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
/*
|
|
|
|
* Align the end of the region with the LCM of PAGE_SIZE and
|
|
|
|
* PCPU_BITMAP_BLOCK_SIZE. One of these constants is a multiple of
|
|
|
|
* the other.
|
|
|
|
*/
|
|
|
|
lcm_align = lcm(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE);
|
|
|
|
region_size = ALIGN(start_offset + map_size, lcm_align);
|
2017-07-25 07:02:02 +08:00
|
|
|
|
2017-07-25 07:02:05 +08:00
|
|
|
/* allocate chunk */
|
2020-10-31 04:40:21 +08:00
|
|
|
alloc_size = struct_size(chunk, populated,
|
|
|
|
BITS_TO_LONGS(region_size >> PAGE_SHIFT));
|
2019-03-12 14:30:15 +08:00
|
|
|
chunk = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!chunk)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
2017-07-25 07:02:05 +08:00
|
|
|
|
2017-07-25 07:02:02 +08:00
|
|
|
INIT_LIST_HEAD(&chunk->list);
|
2017-07-25 07:02:05 +08:00
|
|
|
|
|
|
|
chunk->base_addr = (void *)aligned_addr;
|
2017-07-25 07:02:02 +08:00
|
|
|
chunk->start_offset = start_offset;
|
2017-07-25 07:02:03 +08:00
|
|
|
chunk->end_offset = region_size - chunk->start_offset - map_size;
|
2017-07-25 07:02:05 +08:00
|
|
|
|
2017-07-25 07:02:07 +08:00
|
|
|
chunk->nr_pages = region_size >> PAGE_SHIFT;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
region_bits = pcpu_chunk_map_bits(chunk);
|
2017-07-25 07:02:05 +08:00
|
|
|
|
2019-03-12 14:30:15 +08:00
|
|
|
alloc_size = BITS_TO_LONGS(region_bits) * sizeof(chunk->alloc_map[0]);
|
|
|
|
chunk->alloc_map = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!chunk->alloc_map)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
|
|
|
|
|
|
|
alloc_size =
|
|
|
|
BITS_TO_LONGS(region_bits + 1) * sizeof(chunk->bound_map[0]);
|
|
|
|
chunk->bound_map = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!chunk->bound_map)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
|
|
|
|
|
|
|
alloc_size = pcpu_chunk_nr_blocks(chunk) * sizeof(chunk->md_blocks[0]);
|
|
|
|
chunk->md_blocks = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!chunk->md_blocks)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2021-06-03 09:09:31 +08:00
|
|
|
/* first chunk is free to use */
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
chunk->obj_cgroups = NULL;
|
|
|
|
#endif
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_init_md_blocks(chunk);
|
2017-07-25 07:02:02 +08:00
|
|
|
|
|
|
|
/* manage populated page bitmap */
|
|
|
|
chunk->immutable = true;
|
2017-07-25 07:02:07 +08:00
|
|
|
bitmap_fill(chunk->populated, chunk->nr_pages);
|
|
|
|
chunk->nr_populated = chunk->nr_pages;
|
2019-02-14 03:10:30 +08:00
|
|
|
chunk->nr_empty_pop_pages = chunk->nr_pages;
|
2017-07-25 07:02:02 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
chunk->free_bytes = map_size;
|
2017-07-25 07:02:05 +08:00
|
|
|
|
|
|
|
if (chunk->start_offset) {
|
|
|
|
/* hide the beginning of the bitmap */
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
offset_bits = chunk->start_offset / PCPU_MIN_ALLOC_SIZE;
|
|
|
|
bitmap_set(chunk->alloc_map, 0, offset_bits);
|
|
|
|
set_bit(0, chunk->bound_map);
|
|
|
|
set_bit(offset_bits, chunk->bound_map);
|
2017-07-25 07:02:12 +08:00
|
|
|
|
2019-02-27 02:00:08 +08:00
|
|
|
chunk->chunk_md.first_free = offset_bits;
|
2017-07-25 07:02:13 +08:00
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_block_update_hint_alloc(chunk, 0, offset_bits);
|
2017-07-25 07:02:05 +08:00
|
|
|
}
|
|
|
|
|
2017-07-25 07:02:03 +08:00
|
|
|
if (chunk->end_offset) {
|
|
|
|
/* hide the end of the bitmap */
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
offset_bits = chunk->end_offset / PCPU_MIN_ALLOC_SIZE;
|
|
|
|
bitmap_set(chunk->alloc_map,
|
|
|
|
pcpu_chunk_map_bits(chunk) - offset_bits,
|
|
|
|
offset_bits);
|
|
|
|
set_bit((start_offset + map_size) / PCPU_MIN_ALLOC_SIZE,
|
|
|
|
chunk->bound_map);
|
|
|
|
set_bit(region_bits, chunk->bound_map);
|
2017-07-25 07:02:03 +08:00
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_block_update_hint_alloc(chunk, pcpu_chunk_map_bits(chunk)
|
|
|
|
- offset_bits, offset_bits);
|
|
|
|
}
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
|
2017-07-25 07:02:02 +08:00
|
|
|
return chunk;
|
|
|
|
}
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
|
2010-04-09 17:57:01 +08:00
|
|
|
{
|
|
|
|
struct pcpu_chunk *chunk;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
int region_bits;
|
2010-04-09 17:57:01 +08:00
|
|
|
|
2018-02-17 02:07:19 +08:00
|
|
|
chunk = pcpu_mem_zalloc(pcpu_chunk_struct_size, gfp);
|
2010-04-09 17:57:01 +08:00
|
|
|
if (!chunk)
|
|
|
|
return NULL;
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
INIT_LIST_HEAD(&chunk->list);
|
|
|
|
chunk->nr_pages = pcpu_unit_pages;
|
|
|
|
region_bits = pcpu_chunk_map_bits(chunk);
|
2010-04-09 17:57:01 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
chunk->alloc_map = pcpu_mem_zalloc(BITS_TO_LONGS(region_bits) *
|
2018-02-17 02:07:19 +08:00
|
|
|
sizeof(chunk->alloc_map[0]), gfp);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
if (!chunk->alloc_map)
|
|
|
|
goto alloc_map_fail;
|
2010-04-09 17:57:01 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
chunk->bound_map = pcpu_mem_zalloc(BITS_TO_LONGS(region_bits + 1) *
|
2018-02-17 02:07:19 +08:00
|
|
|
sizeof(chunk->bound_map[0]), gfp);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
if (!chunk->bound_map)
|
|
|
|
goto bound_map_fail;
|
2010-04-09 17:57:01 +08:00
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
chunk->md_blocks = pcpu_mem_zalloc(pcpu_chunk_nr_blocks(chunk) *
|
2018-02-17 02:07:19 +08:00
|
|
|
sizeof(chunk->md_blocks[0]), gfp);
|
2017-07-25 07:02:12 +08:00
|
|
|
if (!chunk->md_blocks)
|
|
|
|
goto md_blocks_fail;
|
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2021-06-03 09:09:31 +08:00
|
|
|
if (!mem_cgroup_kmem_disabled()) {
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
chunk->obj_cgroups =
|
|
|
|
pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
|
|
|
|
sizeof(struct obj_cgroup *), gfp);
|
|
|
|
if (!chunk->obj_cgroups)
|
|
|
|
goto objcg_fail;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2017-07-25 07:02:12 +08:00
|
|
|
pcpu_init_md_blocks(chunk);
|
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
/* init metadata */
|
|
|
|
chunk->free_bytes = chunk->nr_pages * PAGE_SIZE;
|
2017-07-25 07:02:05 +08:00
|
|
|
|
2010-04-09 17:57:01 +08:00
|
|
|
return chunk;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
|
|
|
objcg_fail:
|
|
|
|
pcpu_mem_free(chunk->md_blocks);
|
|
|
|
#endif
|
2017-07-25 07:02:12 +08:00
|
|
|
md_blocks_fail:
|
|
|
|
pcpu_mem_free(chunk->bound_map);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
bound_map_fail:
|
|
|
|
pcpu_mem_free(chunk->alloc_map);
|
|
|
|
alloc_map_fail:
|
|
|
|
pcpu_mem_free(chunk);
|
|
|
|
|
|
|
|
return NULL;
|
2010-04-09 17:57:01 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_free_chunk(struct pcpu_chunk *chunk)
|
|
|
|
{
|
|
|
|
if (!chunk)
|
|
|
|
return;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
|
|
|
pcpu_mem_free(chunk->obj_cgroups);
|
|
|
|
#endif
|
2018-10-07 16:31:51 +08:00
|
|
|
pcpu_mem_free(chunk->md_blocks);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
pcpu_mem_free(chunk->bound_map);
|
|
|
|
pcpu_mem_free(chunk->alloc_map);
|
2016-01-23 07:11:02 +08:00
|
|
|
pcpu_mem_free(chunk);
|
2010-04-09 17:57:01 +08:00
|
|
|
}
|
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
/**
|
|
|
|
* pcpu_chunk_populated - post-population bookkeeping
|
|
|
|
* @chunk: pcpu_chunk which got populated
|
|
|
|
* @page_start: the start page
|
|
|
|
* @page_end: the end page
|
|
|
|
*
|
|
|
|
* Pages in [@page_start,@page_end) have been populated to @chunk. Update
|
|
|
|
* the bookkeeping information accordingly. Must be called after each
|
|
|
|
* successful population.
|
|
|
|
*/
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
static void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start,
|
2019-02-14 03:10:30 +08:00
|
|
|
int page_end)
|
2014-09-03 02:46:05 +08:00
|
|
|
{
|
|
|
|
int nr = page_end - page_start;
|
|
|
|
|
|
|
|
lockdep_assert_held(&pcpu_lock);
|
|
|
|
|
|
|
|
bitmap_set(chunk->populated, page_start, nr);
|
|
|
|
chunk->nr_populated += nr;
|
2018-08-22 12:53:58 +08:00
|
|
|
pcpu_nr_populated += nr;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
|
2019-02-14 03:10:30 +08:00
|
|
|
pcpu_update_empty_pages(chunk, nr);
|
2014-09-03 02:46:05 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_chunk_depopulated - post-depopulation bookkeeping
|
|
|
|
* @chunk: pcpu_chunk which got depopulated
|
|
|
|
* @page_start: the start page
|
|
|
|
* @page_end: the end page
|
|
|
|
*
|
|
|
|
* Pages in [@page_start,@page_end) have been depopulated from @chunk.
|
|
|
|
* Update the bookkeeping information accordingly. Must be called after
|
|
|
|
* each successful depopulation.
|
|
|
|
*/
|
|
|
|
static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk,
|
|
|
|
int page_start, int page_end)
|
|
|
|
{
|
|
|
|
int nr = page_end - page_start;
|
|
|
|
|
|
|
|
lockdep_assert_held(&pcpu_lock);
|
|
|
|
|
|
|
|
bitmap_clear(chunk->populated, page_start, nr);
|
|
|
|
chunk->nr_populated -= nr;
|
2018-08-22 12:53:58 +08:00
|
|
|
pcpu_nr_populated -= nr;
|
2019-02-14 03:10:30 +08:00
|
|
|
|
|
|
|
pcpu_update_empty_pages(chunk, -nr);
|
2014-09-03 02:46:05 +08:00
|
|
|
}
|
|
|
|
|
2010-04-09 17:57:01 +08:00
|
|
|
/*
|
|
|
|
* Chunk management implementation.
|
|
|
|
*
|
|
|
|
* To allow different implementations, chunk alloc/free and
|
|
|
|
* [de]population are implemented in a separate file which is pulled
|
|
|
|
* into this file and compiled together. The following functions
|
|
|
|
* should be implemented.
|
|
|
|
*
|
|
|
|
* pcpu_populate_chunk - populate the specified range of a chunk
|
|
|
|
* pcpu_depopulate_chunk - depopulate the specified range of a chunk
|
2021-07-03 11:49:57 +08:00
|
|
|
* pcpu_post_unmap_tlb_flush - flush tlb for the specified range of a chunk
|
2010-04-09 17:57:01 +08:00
|
|
|
* pcpu_create_chunk - create a new chunk
|
|
|
|
* pcpu_destroy_chunk - destroy a chunk, always preceded by full depop
|
|
|
|
* pcpu_addr_to_page - translate address to physical address
|
|
|
|
* pcpu_verify_alloc_info - check alloc_info is acceptable during init
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
2018-02-16 00:08:14 +08:00
|
|
|
static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
|
2018-02-17 02:07:19 +08:00
|
|
|
int page_start, int page_end, gfp_t gfp);
|
2018-02-16 00:08:14 +08:00
|
|
|
static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
|
|
|
|
int page_start, int page_end);
|
2021-07-03 11:49:57 +08:00
|
|
|
static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
|
|
|
|
int page_start, int page_end);
|
2021-06-03 09:09:31 +08:00
|
|
|
static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp);
|
2010-04-09 17:57:01 +08:00
|
|
|
static void pcpu_destroy_chunk(struct pcpu_chunk *chunk);
|
|
|
|
static struct page *pcpu_addr_to_page(void *addr);
|
|
|
|
static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2010-04-09 17:57:01 +08:00
|
|
|
#ifdef CONFIG_NEED_PER_CPU_KM
|
|
|
|
#include "percpu-km.c"
|
|
|
|
#else
|
2010-04-09 17:57:01 +08:00
|
|
|
#include "percpu-vm.c"
|
2010-04-09 17:57:01 +08:00
|
|
|
#endif
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2010-04-09 17:57:01 +08:00
|
|
|
/**
|
|
|
|
* pcpu_chunk_addr_search - determine chunk containing specified address
|
|
|
|
* @addr: address for which the chunk needs to be determined.
|
|
|
|
*
|
2017-07-25 07:02:05 +08:00
|
|
|
* This is an internal function that handles all but static allocations.
|
|
|
|
* Static percpu address values should never be passed into the allocator.
|
|
|
|
*
|
2010-04-09 17:57:01 +08:00
|
|
|
* RETURNS:
|
|
|
|
* The address of the found chunk.
|
|
|
|
*/
|
|
|
|
static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
|
|
|
|
{
|
2017-07-25 07:02:05 +08:00
|
|
|
/* is it in the dynamic region (first chunk)? */
|
2017-07-25 07:02:06 +08:00
|
|
|
if (pcpu_addr_in_chunk(pcpu_first_chunk, addr))
|
2010-04-09 17:57:01 +08:00
|
|
|
return pcpu_first_chunk;
|
2017-07-25 07:02:05 +08:00
|
|
|
|
|
|
|
/* is it in the reserved region? */
|
2017-07-25 07:02:06 +08:00
|
|
|
if (pcpu_addr_in_chunk(pcpu_reserved_chunk, addr))
|
2017-07-25 07:02:05 +08:00
|
|
|
return pcpu_reserved_chunk;
|
2010-04-09 17:57:01 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The address is relative to unit0 which might be unused and
|
|
|
|
* thus unmapped. Offset the address to the unit space of the
|
|
|
|
* current processor before looking it up in the vmalloc
|
|
|
|
* space. Note that any possible cpu id can be used here, so
|
|
|
|
* there's no need to worry about preemption or cpu hotplug.
|
|
|
|
*/
|
|
|
|
addr += pcpu_unit_offsets[raw_smp_processor_id()];
|
2010-04-09 17:57:01 +08:00
|
|
|
return pcpu_get_page_chunk(pcpu_addr_to_page(addr));
|
2010-04-09 17:57:01 +08:00
|
|
|
}
|
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
#ifdef CONFIG_MEMCG_KMEM
|
2021-06-03 09:09:31 +08:00
|
|
|
static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
|
|
|
|
struct obj_cgroup **objcgp)
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
{
|
|
|
|
struct obj_cgroup *objcg;
|
|
|
|
|
2020-10-18 07:13:44 +08:00
|
|
|
if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT))
|
2021-06-03 09:09:31 +08:00
|
|
|
return true;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
|
|
|
|
objcg = get_obj_cgroup_from_current();
|
|
|
|
if (!objcg)
|
2021-06-03 09:09:31 +08:00
|
|
|
return true;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
|
|
|
|
if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) {
|
|
|
|
obj_cgroup_put(objcg);
|
2021-06-03 09:09:31 +08:00
|
|
|
return false;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
*objcgp = objcg;
|
2021-06-03 09:09:31 +08:00
|
|
|
return true;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
|
|
|
|
struct pcpu_chunk *chunk, int off,
|
|
|
|
size_t size)
|
|
|
|
{
|
|
|
|
if (!objcg)
|
|
|
|
return;
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
if (likely(chunk && chunk->obj_cgroups)) {
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg;
|
2020-08-12 09:30:21 +08:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
|
|
|
|
size * num_possible_cpus());
|
|
|
|
rcu_read_unlock();
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
} else {
|
|
|
|
obj_cgroup_uncharge(objcg, size * num_possible_cpus());
|
|
|
|
obj_cgroup_put(objcg);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
|
|
|
|
{
|
|
|
|
struct obj_cgroup *objcg;
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
if (unlikely(!chunk->obj_cgroups))
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT];
|
2021-06-03 09:09:31 +08:00
|
|
|
if (!objcg)
|
|
|
|
return;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL;
|
|
|
|
|
|
|
|
obj_cgroup_uncharge(objcg, size * num_possible_cpus());
|
|
|
|
|
2020-08-12 09:30:21 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
|
|
|
|
-(size * num_possible_cpus()));
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
obj_cgroup_put(objcg);
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* CONFIG_MEMCG_KMEM */
|
2021-06-03 09:09:31 +08:00
|
|
|
static bool
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp)
|
|
|
|
{
|
2021-06-03 09:09:31 +08:00
|
|
|
return true;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
|
|
|
|
struct pcpu_chunk *chunk, int off,
|
|
|
|
size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_MEMCG_KMEM */
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
2009-03-06 13:33:59 +08:00
|
|
|
* pcpu_alloc - the percpu allocator
|
2009-02-21 15:56:23 +08:00
|
|
|
* @size: size of area to allocate in bytes
|
2009-02-20 15:29:08 +08:00
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
2009-03-06 13:33:59 +08:00
|
|
|
* @reserved: allocate from the reserved chunk if available
|
2014-09-03 02:46:04 +08:00
|
|
|
* @gfp: allocation flags
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
2014-09-03 02:46:04 +08:00
|
|
|
* Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't
|
mm, percpu: add support for __GFP_NOWARN flag
Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
Currently, we always throw a warning when size or alignment
is unsupported (and also dump stack on failed allocation
requests). The warning itself is harmless since we return
NULL anyway for any failed request, which callers are
required to handle anyway. However, it becomes harmful when
panic_on_warn is set.
The rationale for the WARN() in pcpu_alloc() is that it can
be tracked when larger than supported allocation requests are
made such that allocations limits can be tweaked if warranted.
This makes sense for in-kernel users, however, there are users
of pcpu allocator where allocation size is derived from user
space requests, e.g. when creating BPF maps. In these cases,
the requests should fail gracefully without throwing a splat.
The current work-around was to check allocation size against
the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
bailing out prior to a call to pcpu_alloc() in order to
avoid throwing the WARN(). This is bad in multiple ways since
PCPU_MIN_UNIT_SIZE is an implementation detail, and having
the checks on call-sites only complicates the code for no
good reason. Thus, lets fix it generically by supporting the
__GFP_NOWARN flag that users can then use with calling the
__alloc_percpu_gfp() helper instead.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17 22:55:52 +08:00
|
|
|
* contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
|
|
|
|
* then no warning will be triggered on invalid or failed allocation
|
|
|
|
* requests.
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Percpu pointer to the allocated area on success, NULL on failure.
|
|
|
|
*/
|
2014-09-03 02:46:04 +08:00
|
|
|
static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
|
|
|
|
gfp_t gfp)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
percpu: make pcpu_alloc() aware of current gfp context
Since 5.7-rc1, on btrfs we have a percpu counter initialization for
which we always pass a GFP_KERNEL gfp_t argument (this happens since
commit 2992df73268f78 ("btrfs: Implement DREW lock")).
That is safe in some contextes but not on others where allowing fs
reclaim could lead to a deadlock because we are either holding some
btrfs lock needed for a transaction commit or holding a btrfs
transaction handle open. Because of that we surround the call to the
function that initializes the percpu counter with a NOFS context using
memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
However it turns out that this is not enough to prevent a possible
deadlock because percpu_alloc() determines if it is in an atomic context
by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
case) and it is not aware that a NOFS context is set.
Because percpu_alloc() thinks it is in a non atomic context it locks the
pcpu_alloc_mutex. This can result in a btrfs deadlock when
pcpu_balance_workfn() is running, has acquired that mutex and is waiting
for reclaim, while the btrfs task that called percpu_counter_init() (and
therefore percpu_alloc()) is holding either the btrfs commit_root
semaphore or a transaction handle (done fs/btrfs/backref.c:
iterate_extent_inodes()), which prevents reclaim from finishing as an
attempt to commit the current btrfs transaction will deadlock.
Lockdep reports this issue with the following trace:
======================================================
WARNING: possible circular locking dependency detected
5.6.0-rc7-btrfs-next-77 #1 Not tainted
------------------------------------------------------
kswapd0/91 is trying to acquire lock:
ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
but task is already holding lock:
ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (fs_reclaim){+.+.}:
fs_reclaim_acquire.part.0+0x25/0x30
__kmalloc+0x5f/0x3a0
pcpu_create_chunk+0x19/0x230
pcpu_balance_workfn+0x56a/0x680
process_one_work+0x235/0x5f0
worker_thread+0x50/0x3b0
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
-> #3 (pcpu_alloc_mutex){+.+.}:
__mutex_lock+0xa9/0xaf0
pcpu_alloc+0x480/0x7c0
__percpu_counter_init+0x50/0xd0
btrfs_drew_lock_init+0x22/0x70 [btrfs]
btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
resolve_indirect_refs+0x120/0xa30 [btrfs]
find_parent_nodes+0x50b/0xf30 [btrfs]
btrfs_find_all_leafs+0x60/0xb0 [btrfs]
iterate_extent_inodes+0x139/0x2f0 [btrfs]
iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
btrfs_ioctl+0x165a/0x3130 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #2 (&fs_info->commit_root_sem){++++}:
down_write+0x38/0x70
btrfs_cache_block_group+0x2ec/0x500 [btrfs]
find_free_extent+0xc6a/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
commit_cowonly_roots+0x55/0x310 [btrfs]
btrfs_commit_transaction+0x509/0xb20 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x93/0xc0
exit_to_usermode_loop+0xf9/0x100
do_syscall_64+0x20d/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #1 (&space_info->groups_sem){++++}:
down_read+0x3c/0x140
find_free_extent+0xef6/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
btrfs_search_slot+0x50c/0xd60 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
__btrfs_update_delayed_inode+0x90/0x280 [btrfs]
__btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
__btrfs_run_delayed_items+0x8e/0x180 [btrfs]
btrfs_commit_transaction+0x31b/0xb20 [btrfs]
iterate_supers+0x87/0xf0
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (&delayed_node->mutex){+.+.}:
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(pcpu_alloc_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/91:
#0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
#1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
#2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
stack backtrace:
CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8f/0xd0
check_noncircular+0x170/0x190
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
to percpu_counter_init() in contextes where it is not reclaim safe,
however that type of approach is discouraged since
memalloc_[nofs|noio]_save() were introduced. Therefore this change
makes pcpu_alloc() look up into an existing nofs/noio context before
deciding whether it is in an atomic context or not.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 09:36:10 +08:00
|
|
|
gfp_t pcpu_gfp;
|
|
|
|
bool is_atomic;
|
|
|
|
bool do_warn;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
struct obj_cgroup *objcg = NULL;
|
2009-09-29 08:17:58 +08:00
|
|
|
static int warn_limit = 10;
|
2019-02-26 01:03:50 +08:00
|
|
|
struct pcpu_chunk *chunk, *next;
|
2009-09-29 08:17:58 +08:00
|
|
|
const char *err;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
int slot, off, cpu, ret;
|
2009-10-28 23:25:59 +08:00
|
|
|
unsigned long flags;
|
2011-09-27 00:12:53 +08:00
|
|
|
void __percpu *ptr;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
size_t bits, bit_align;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: make pcpu_alloc() aware of current gfp context
Since 5.7-rc1, on btrfs we have a percpu counter initialization for
which we always pass a GFP_KERNEL gfp_t argument (this happens since
commit 2992df73268f78 ("btrfs: Implement DREW lock")).
That is safe in some contextes but not on others where allowing fs
reclaim could lead to a deadlock because we are either holding some
btrfs lock needed for a transaction commit or holding a btrfs
transaction handle open. Because of that we surround the call to the
function that initializes the percpu counter with a NOFS context using
memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
However it turns out that this is not enough to prevent a possible
deadlock because percpu_alloc() determines if it is in an atomic context
by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
case) and it is not aware that a NOFS context is set.
Because percpu_alloc() thinks it is in a non atomic context it locks the
pcpu_alloc_mutex. This can result in a btrfs deadlock when
pcpu_balance_workfn() is running, has acquired that mutex and is waiting
for reclaim, while the btrfs task that called percpu_counter_init() (and
therefore percpu_alloc()) is holding either the btrfs commit_root
semaphore or a transaction handle (done fs/btrfs/backref.c:
iterate_extent_inodes()), which prevents reclaim from finishing as an
attempt to commit the current btrfs transaction will deadlock.
Lockdep reports this issue with the following trace:
======================================================
WARNING: possible circular locking dependency detected
5.6.0-rc7-btrfs-next-77 #1 Not tainted
------------------------------------------------------
kswapd0/91 is trying to acquire lock:
ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
but task is already holding lock:
ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #4 (fs_reclaim){+.+.}:
fs_reclaim_acquire.part.0+0x25/0x30
__kmalloc+0x5f/0x3a0
pcpu_create_chunk+0x19/0x230
pcpu_balance_workfn+0x56a/0x680
process_one_work+0x235/0x5f0
worker_thread+0x50/0x3b0
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
-> #3 (pcpu_alloc_mutex){+.+.}:
__mutex_lock+0xa9/0xaf0
pcpu_alloc+0x480/0x7c0
__percpu_counter_init+0x50/0xd0
btrfs_drew_lock_init+0x22/0x70 [btrfs]
btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
resolve_indirect_refs+0x120/0xa30 [btrfs]
find_parent_nodes+0x50b/0xf30 [btrfs]
btrfs_find_all_leafs+0x60/0xb0 [btrfs]
iterate_extent_inodes+0x139/0x2f0 [btrfs]
iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
btrfs_ioctl+0x165a/0x3130 [btrfs]
ksys_ioctl+0x87/0xc0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #2 (&fs_info->commit_root_sem){++++}:
down_write+0x38/0x70
btrfs_cache_block_group+0x2ec/0x500 [btrfs]
find_free_extent+0xc6a/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
commit_cowonly_roots+0x55/0x310 [btrfs]
btrfs_commit_transaction+0x509/0xb20 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x93/0xc0
exit_to_usermode_loop+0xf9/0x100
do_syscall_64+0x20d/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #1 (&space_info->groups_sem){++++}:
down_read+0x3c/0x140
find_free_extent+0xef6/0x1600 [btrfs]
btrfs_reserve_extent+0x9b/0x180 [btrfs]
btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
__btrfs_cow_block+0x122/0x5a0 [btrfs]
btrfs_cow_block+0x106/0x240 [btrfs]
btrfs_search_slot+0x50c/0xd60 [btrfs]
btrfs_lookup_inode+0x3a/0xc0 [btrfs]
__btrfs_update_delayed_inode+0x90/0x280 [btrfs]
__btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
__btrfs_run_delayed_items+0x8e/0x180 [btrfs]
btrfs_commit_transaction+0x31b/0xb20 [btrfs]
iterate_supers+0x87/0xf0
ksys_sync+0x60/0xb0
__ia32_sys_sync+0xa/0x10
do_syscall_64+0x5c/0x260
entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (&delayed_node->mutex){+.+.}:
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(pcpu_alloc_mutex);
lock(fs_reclaim);
lock(&delayed_node->mutex);
*** DEADLOCK ***
3 locks held by kswapd0/91:
#0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
#1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
#2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
stack backtrace:
CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8f/0xd0
check_noncircular+0x170/0x190
__lock_acquire+0xef0/0x1c80
lock_acquire+0xa2/0x1d0
__mutex_lock+0xa9/0xaf0
__btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
btrfs_evict_inode+0x40d/0x560 [btrfs]
evict+0xd9/0x1c0
dispose_list+0x48/0x70
prune_icache_sb+0x54/0x80
super_cache_scan+0x124/0x1a0
do_shrink_slab+0x176/0x440
shrink_slab+0x23a/0x2c0
shrink_node+0x188/0x6e0
balance_pgdat+0x31d/0x7f0
kswapd+0x238/0x550
kthread+0x120/0x140
ret_from_fork+0x3a/0x50
This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
to percpu_counter_init() in contextes where it is not reclaim safe,
however that type of approach is discouraged since
memalloc_[nofs|noio]_save() were introduced. Therefore this change
makes pcpu_alloc() look up into an existing nofs/noio context before
deciding whether it is in an atomic context or not.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-08 09:36:10 +08:00
|
|
|
gfp = current_gfp_context(gfp);
|
|
|
|
/* whitelisted flags that can be passed to the backing allocators */
|
|
|
|
pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
|
|
|
|
is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
|
|
|
|
do_warn = !(gfp & __GFP_NOWARN);
|
|
|
|
|
percpu: store offsets instead of lengths in ->map[]
Current code keeps +-length for each area in chunk->map[]. It has
several unpleasant consequences:
* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in. Note that offsets are monotonous, so simple
binary search would do here.
New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>. The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.
In other words, where the old variant would have
4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
0, 5, 13, 16, 21, 32, 133
This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do. Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 10:13:18 +08:00
|
|
|
/*
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
* There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE,
|
|
|
|
* therefore alignment must be a minimum of that many bytes.
|
|
|
|
* An allocation may have internal fragmentation from rounding up
|
|
|
|
* of up to PCPU_MIN_ALLOC_SIZE - 1 bytes.
|
percpu: store offsets instead of lengths in ->map[]
Current code keeps +-length for each area in chunk->map[]. It has
several unpleasant consequences:
* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in. Note that offsets are monotonous, so simple
binary search would do here.
New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>. The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.
In other words, where the old variant would have
4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
0, 5, 13, 16, 21, 32, 133
This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do. Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 10:13:18 +08:00
|
|
|
*/
|
2017-07-25 07:02:09 +08:00
|
|
|
if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
|
|
|
|
align = PCPU_MIN_ALLOC_SIZE;
|
percpu: store offsets instead of lengths in ->map[]
Current code keeps +-length for each area in chunk->map[]. It has
several unpleasant consequences:
* even if we know that first 50 areas are all in use, allocation
still needs to go through all those areas just to sum their sizes, just
to get the offset of free one.
* freeing needs to find the array entry refering to the area
in question; again, the need to sum the sizes until we reach the offset
we are interested in. Note that offsets are monotonous, so simple
binary search would do here.
New data representation: array of <offset,in-use flag> pairs.
Each pair is represented by one int - we use offset|1 for <offset, in use>
and offset for <offset, free> (we make sure that all offsets are even).
In the end we put a sentry entry - <total size, in use>. The first
entry is <0, flag>; it would be possible to store together the flag
for Nth area and offset for N+1st, but that leads to much hairier code.
In other words, where the old variant would have
4, -8, -4, 4, -12, 100
(4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
i.e.
0, 5, 13, 16, 21, 32, 133
This commit switches to new data representation and takes care of a couple
of low-hanging fruits in free_pcpu_area() - one is the switch to binary
search, another is not doing two memmove() when one would do. Speeding
the alloc side up (by keeping track of how many areas in the beginning are
known to be all in use) also becomes possible - that'll be done in the next
commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07 10:13:18 +08:00
|
|
|
|
2017-07-25 07:02:09 +08:00
|
|
|
size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
bits = size >> PCPU_MIN_ALLOC_SHIFT;
|
|
|
|
bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
|
2014-03-18 04:01:27 +08:00
|
|
|
|
percpu: ensure the requested alignment is power of two
The percpu allocator expectedly assumes that the requested alignment
is power of two but hasn't been veryfing the input. If the specified
alignment isn't power of two, the allocator can malfunction. Add the
sanity check.
The following is detailed analysis of the effects of alignments which
aren't power of two.
The alignment must be a even at least since the LSB of a chunk->map
element is used as free/in-use flag of a area; besides, the alignment
must be a power of 2 too since ALIGN() doesn't work well for other
alignment always but is adopted by pcpu_fit_in_area(). IOW, the
current allocator only works well for a power of 2 aligned area
allocation.
See below opposite example for why an odd alignment doesn't work.
Let's assume area [16, 36) is free but its previous one is in-use, we
want to allocate a @size == 8 and @align == 7 area. The larger area
[16, 36) is split to three areas [16, 21), [21, 29), [29, 36)
eventually. However, due to the usage for a chunk->map element, the
actual offset of the aim area [21, 29) is 21 but is recorded in
relevant element as 20; moreover, the residual tail free area [29,
36) is mistook as in-use and is lost silently
Unlike macro roundup(), ALIGN(x, a) doesn't work if @a isn't a power
of 2 for example, roundup(10, 6) == 12 but ALIGN(10, 6) == 10, and
the latter result isn't desired obviously.
tj: Code style and patch description updates.
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-14 15:12:54 +08:00
|
|
|
if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE ||
|
|
|
|
!is_power_of_2(align))) {
|
mm, percpu: add support for __GFP_NOWARN flag
Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
Currently, we always throw a warning when size or alignment
is unsupported (and also dump stack on failed allocation
requests). The warning itself is harmless since we return
NULL anyway for any failed request, which callers are
required to handle anyway. However, it becomes harmful when
panic_on_warn is set.
The rationale for the WARN() in pcpu_alloc() is that it can
be tracked when larger than supported allocation requests are
made such that allocations limits can be tweaked if warranted.
This makes sense for in-kernel users, however, there are users
of pcpu allocator where allocation size is derived from user
space requests, e.g. when creating BPF maps. In these cases,
the requests should fail gracefully without throwing a splat.
The current work-around was to check allocation size against
the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
bailing out prior to a call to pcpu_alloc() in order to
avoid throwing the WARN(). This is bad in multiple ways since
PCPU_MIN_UNIT_SIZE is an implementation detail, and having
the checks on call-sites only complicates the code for no
good reason. Thus, lets fix it generically by supporting the
__GFP_NOWARN flag that users can then use with calling the
__alloc_percpu_gfp() helper instead.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17 22:55:52 +08:00
|
|
|
WARN(do_warn, "illegal size (%zu) or align (%zu) for percpu allocation\n",
|
2016-03-18 05:19:47 +08:00
|
|
|
size, align);
|
2009-02-20 15:29:08 +08:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
if (unlikely(!pcpu_memcg_pre_alloc_hook(size, gfp, &objcg)))
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
return NULL;
|
|
|
|
|
2018-03-19 23:32:10 +08:00
|
|
|
if (!is_atomic) {
|
|
|
|
/*
|
|
|
|
* pcpu_balance_workfn() allocates memory under this mutex,
|
|
|
|
* and it may wait for memory reclaim. Allow current task
|
|
|
|
* to become OOM victim, in case of memory pressure.
|
|
|
|
*/
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
if (gfp & __GFP_NOFAIL) {
|
2018-03-19 23:32:10 +08:00
|
|
|
mutex_lock(&pcpu_alloc_mutex);
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
} else if (mutex_lock_killable(&pcpu_alloc_mutex)) {
|
|
|
|
pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
|
2018-03-19 23:32:10 +08:00
|
|
|
return NULL;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
}
|
2018-03-19 23:32:10 +08:00
|
|
|
}
|
2016-05-25 23:48:25 +08:00
|
|
|
|
2009-10-28 23:25:59 +08:00
|
|
|
spin_lock_irqsave(&pcpu_lock, flags);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2009-03-06 13:33:59 +08:00
|
|
|
/* serve reserved allocations from the reserved chunk if available */
|
|
|
|
if (reserved && pcpu_reserved_chunk) {
|
|
|
|
chunk = pcpu_reserved_chunk;
|
2009-11-11 14:35:18 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
off = pcpu_find_block_fit(chunk, bits, bit_align, is_atomic);
|
|
|
|
if (off < 0) {
|
2009-11-11 14:35:18 +08:00
|
|
|
err = "alloc from reserved chunk failed";
|
2009-03-06 23:44:13 +08:00
|
|
|
goto fail_unlock;
|
2009-09-29 08:17:58 +08:00
|
|
|
}
|
2009-11-11 14:35:18 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
off = pcpu_alloc_area(chunk, bits, bit_align, off);
|
2009-03-06 13:33:59 +08:00
|
|
|
if (off >= 0)
|
|
|
|
goto area_found;
|
2009-11-11 14:35:18 +08:00
|
|
|
|
2009-09-29 08:17:58 +08:00
|
|
|
err = "alloc from reserved chunk failed";
|
2009-03-06 23:44:13 +08:00
|
|
|
goto fail_unlock;
|
2009-03-06 13:33:59 +08:00
|
|
|
}
|
|
|
|
|
2009-03-06 23:44:13 +08:00
|
|
|
restart:
|
2009-03-06 13:33:59 +08:00
|
|
|
/* search through normal chunks */
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
for (slot = pcpu_size_to_slot(size); slot <= pcpu_free_slot; slot++) {
|
2021-06-03 09:09:31 +08:00
|
|
|
list_for_each_entry_safe(chunk, next, &pcpu_chunk_lists[slot],
|
|
|
|
list) {
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
off = pcpu_find_block_fit(chunk, bits, bit_align,
|
|
|
|
is_atomic);
|
2019-02-26 01:03:50 +08:00
|
|
|
if (off < 0) {
|
|
|
|
if (slot < PCPU_SLOT_FAIL_THRESHOLD)
|
|
|
|
pcpu_chunk_move(chunk, 0);
|
2009-02-20 15:29:08 +08:00
|
|
|
continue;
|
2019-02-26 01:03:50 +08:00
|
|
|
}
|
2009-03-06 23:44:13 +08:00
|
|
|
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
off = pcpu_alloc_area(chunk, bits, bit_align, off);
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
if (off >= 0) {
|
|
|
|
pcpu_reintegrate_chunk(chunk);
|
2009-02-20 15:29:08 +08:00
|
|
|
goto area_found;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
}
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-10-28 23:25:59 +08:00
|
|
|
spin_unlock_irqrestore(&pcpu_lock, flags);
|
2009-03-06 23:44:13 +08:00
|
|
|
|
2014-09-03 02:46:02 +08:00
|
|
|
/*
|
|
|
|
* No space left. Create a new chunk. We don't want multiple
|
|
|
|
* tasks to create chunks simultaneously. Serialize and create iff
|
|
|
|
* there's still no empty chunk after grabbing the mutex.
|
|
|
|
*/
|
2017-06-21 23:51:09 +08:00
|
|
|
if (is_atomic) {
|
|
|
|
err = "atomic alloc failed, no space left";
|
2014-09-03 02:46:04 +08:00
|
|
|
goto fail;
|
2017-06-21 23:51:09 +08:00
|
|
|
}
|
2014-09-03 02:46:04 +08:00
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
|
|
|
|
chunk = pcpu_create_chunk(pcpu_gfp);
|
2014-09-03 02:46:02 +08:00
|
|
|
if (!chunk) {
|
|
|
|
err = "failed to allocate new chunk";
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock_irqsave(&pcpu_lock, flags);
|
|
|
|
pcpu_chunk_relocate(chunk, -1);
|
|
|
|
} else {
|
|
|
|
spin_lock_irqsave(&pcpu_lock, flags);
|
2009-09-29 08:17:58 +08:00
|
|
|
}
|
2009-03-06 23:44:13 +08:00
|
|
|
|
|
|
|
goto restart;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
|
|
|
area_found:
|
2017-06-20 07:28:31 +08:00
|
|
|
pcpu_stats_area_alloc(chunk, size);
|
2009-10-28 23:25:59 +08:00
|
|
|
spin_unlock_irqrestore(&pcpu_lock, flags);
|
2009-03-06 23:44:13 +08:00
|
|
|
|
2014-09-03 02:46:01 +08:00
|
|
|
/* populate if not all pages are already there */
|
2014-09-03 02:46:04 +08:00
|
|
|
if (!is_atomic) {
|
2019-12-14 08:22:10 +08:00
|
|
|
unsigned int page_start, page_end, rs, re;
|
2014-09-03 02:46:01 +08:00
|
|
|
|
2014-09-03 02:46:04 +08:00
|
|
|
page_start = PFN_DOWN(off);
|
|
|
|
page_end = PFN_UP(off + size);
|
2014-09-03 02:46:02 +08:00
|
|
|
|
2019-12-14 08:22:10 +08:00
|
|
|
bitmap_for_each_clear_region(chunk->populated, rs, re,
|
|
|
|
page_start, page_end) {
|
2014-09-03 02:46:04 +08:00
|
|
|
WARN_ON(chunk->immutable);
|
|
|
|
|
2018-02-17 02:09:58 +08:00
|
|
|
ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
|
2014-09-03 02:46:04 +08:00
|
|
|
|
|
|
|
spin_lock_irqsave(&pcpu_lock, flags);
|
|
|
|
if (ret) {
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
pcpu_free_area(chunk, off);
|
2014-09-03 02:46:04 +08:00
|
|
|
err = "failed to populate";
|
|
|
|
goto fail_unlock;
|
|
|
|
}
|
2019-02-14 03:10:30 +08:00
|
|
|
pcpu_chunk_populated(chunk, rs, re);
|
2014-09-03 02:46:04 +08:00
|
|
|
spin_unlock_irqrestore(&pcpu_lock, flags);
|
2014-09-03 02:46:01 +08:00
|
|
|
}
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2014-09-03 02:46:04 +08:00
|
|
|
mutex_unlock(&pcpu_alloc_mutex);
|
|
|
|
}
|
2009-03-06 23:44:13 +08:00
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
|
2014-09-03 02:46:05 +08:00
|
|
|
pcpu_schedule_balance_work();
|
|
|
|
|
2014-09-03 02:46:01 +08:00
|
|
|
/* clear the areas and return address relative to base address */
|
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
|
|
|
|
|
2011-09-27 00:12:53 +08:00
|
|
|
ptr = __addr_to_pcpu_ptr(chunk->base_addr + off);
|
2015-06-25 07:58:51 +08:00
|
|
|
kmemleak_alloc_percpu(ptr, size, gfp);
|
2017-06-20 07:28:32 +08:00
|
|
|
|
|
|
|
trace_percpu_alloc_percpu(reserved, is_atomic, size, align,
|
|
|
|
chunk->base_addr, off, ptr);
|
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
|
|
|
|
|
2011-09-27 00:12:53 +08:00
|
|
|
return ptr;
|
2009-03-06 23:44:13 +08:00
|
|
|
|
|
|
|
fail_unlock:
|
2009-10-28 23:25:59 +08:00
|
|
|
spin_unlock_irqrestore(&pcpu_lock, flags);
|
2014-09-03 02:46:02 +08:00
|
|
|
fail:
|
2017-06-20 07:28:32 +08:00
|
|
|
trace_percpu_alloc_percpu_fail(reserved, is_atomic, size, align);
|
|
|
|
|
mm, percpu: add support for __GFP_NOWARN flag
Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
Currently, we always throw a warning when size or alignment
is unsupported (and also dump stack on failed allocation
requests). The warning itself is harmless since we return
NULL anyway for any failed request, which callers are
required to handle anyway. However, it becomes harmful when
panic_on_warn is set.
The rationale for the WARN() in pcpu_alloc() is that it can
be tracked when larger than supported allocation requests are
made such that allocations limits can be tweaked if warranted.
This makes sense for in-kernel users, however, there are users
of pcpu allocator where allocation size is derived from user
space requests, e.g. when creating BPF maps. In these cases,
the requests should fail gracefully without throwing a splat.
The current work-around was to check allocation size against
the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
bailing out prior to a call to pcpu_alloc() in order to
avoid throwing the WARN(). This is bad in multiple ways since
PCPU_MIN_UNIT_SIZE is an implementation detail, and having
the checks on call-sites only complicates the code for no
good reason. Thus, lets fix it generically by supporting the
__GFP_NOWARN flag that users can then use with calling the
__alloc_percpu_gfp() helper instead.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17 22:55:52 +08:00
|
|
|
if (!is_atomic && do_warn && warn_limit) {
|
2016-03-18 05:19:53 +08:00
|
|
|
pr_warn("allocation failed, size=%zu align=%zu atomic=%d, %s\n",
|
2016-03-18 05:19:44 +08:00
|
|
|
size, align, is_atomic, err);
|
2009-09-29 08:17:58 +08:00
|
|
|
dump_stack();
|
|
|
|
if (!--warn_limit)
|
2016-03-18 05:19:53 +08:00
|
|
|
pr_info("limit reached, disable warning\n");
|
2009-09-29 08:17:58 +08:00
|
|
|
}
|
2014-09-03 02:46:05 +08:00
|
|
|
if (is_atomic) {
|
2021-05-07 09:06:47 +08:00
|
|
|
/* see the flag handling in pcpu_balance_workfn() */
|
2014-09-03 02:46:05 +08:00
|
|
|
pcpu_atomic_alloc_failed = true;
|
|
|
|
pcpu_schedule_balance_work();
|
2016-05-25 23:48:25 +08:00
|
|
|
} else {
|
|
|
|
mutex_unlock(&pcpu_alloc_mutex);
|
2014-09-03 02:46:05 +08:00
|
|
|
}
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
|
|
|
|
pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
|
|
|
|
|
2009-03-06 23:44:13 +08:00
|
|
|
return NULL;
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
2009-03-06 13:33:59 +08:00
|
|
|
|
|
|
|
/**
|
2014-09-03 02:46:04 +08:00
|
|
|
* __alloc_percpu_gfp - allocate dynamic percpu area
|
2009-03-06 13:33:59 +08:00
|
|
|
* @size: size of area to allocate in bytes
|
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
2014-09-03 02:46:04 +08:00
|
|
|
* @gfp: allocation flags
|
2009-03-06 13:33:59 +08:00
|
|
|
*
|
2014-09-03 02:46:04 +08:00
|
|
|
* Allocate zero-filled percpu area of @size bytes aligned at @align. If
|
|
|
|
* @gfp doesn't contain %GFP_KERNEL, the allocation doesn't block and can
|
mm, percpu: add support for __GFP_NOWARN flag
Add an option for pcpu_alloc() to support __GFP_NOWARN flag.
Currently, we always throw a warning when size or alignment
is unsupported (and also dump stack on failed allocation
requests). The warning itself is harmless since we return
NULL anyway for any failed request, which callers are
required to handle anyway. However, it becomes harmful when
panic_on_warn is set.
The rationale for the WARN() in pcpu_alloc() is that it can
be tracked when larger than supported allocation requests are
made such that allocations limits can be tweaked if warranted.
This makes sense for in-kernel users, however, there are users
of pcpu allocator where allocation size is derived from user
space requests, e.g. when creating BPF maps. In these cases,
the requests should fail gracefully without throwing a splat.
The current work-around was to check allocation size against
the upper limit of PCPU_MIN_UNIT_SIZE from call-sites for
bailing out prior to a call to pcpu_alloc() in order to
avoid throwing the WARN(). This is bad in multiple ways since
PCPU_MIN_UNIT_SIZE is an implementation detail, and having
the checks on call-sites only complicates the code for no
good reason. Thus, lets fix it generically by supporting the
__GFP_NOWARN flag that users can then use with calling the
__alloc_percpu_gfp() helper instead.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-17 22:55:52 +08:00
|
|
|
* be called from any context but is a lot more likely to fail. If @gfp
|
|
|
|
* has __GFP_NOWARN then no warning will be triggered on invalid or failed
|
|
|
|
* allocation requests.
|
2009-03-06 23:44:13 +08:00
|
|
|
*
|
2009-03-06 13:33:59 +08:00
|
|
|
* RETURNS:
|
|
|
|
* Percpu pointer to the allocated area on success, NULL on failure.
|
|
|
|
*/
|
2014-09-03 02:46:04 +08:00
|
|
|
void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp)
|
|
|
|
{
|
|
|
|
return pcpu_alloc(size, align, false, gfp);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__alloc_percpu_gfp);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __alloc_percpu - allocate dynamic percpu area
|
|
|
|
* @size: size of area to allocate in bytes
|
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
|
|
|
*
|
|
|
|
* Equivalent to __alloc_percpu_gfp(size, align, %GFP_KERNEL).
|
|
|
|
*/
|
2010-02-02 13:38:57 +08:00
|
|
|
void __percpu *__alloc_percpu(size_t size, size_t align)
|
2009-03-06 13:33:59 +08:00
|
|
|
{
|
2014-09-03 02:46:04 +08:00
|
|
|
return pcpu_alloc(size, align, false, GFP_KERNEL);
|
2009-03-06 13:33:59 +08:00
|
|
|
}
|
2009-02-20 15:29:08 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__alloc_percpu);
|
|
|
|
|
2009-03-06 13:33:59 +08:00
|
|
|
/**
|
|
|
|
* __alloc_reserved_percpu - allocate reserved percpu area
|
|
|
|
* @size: size of area to allocate in bytes
|
|
|
|
* @align: alignment of area (max PAGE_SIZE)
|
|
|
|
*
|
2010-09-10 17:01:56 +08:00
|
|
|
* Allocate zero-filled percpu area of @size bytes aligned at @align
|
|
|
|
* from reserved percpu area if arch has set it up; otherwise,
|
|
|
|
* allocation is served from the same dynamic area. Might sleep.
|
|
|
|
* Might trigger writeouts.
|
2009-03-06 13:33:59 +08:00
|
|
|
*
|
2009-03-06 23:44:13 +08:00
|
|
|
* CONTEXT:
|
|
|
|
* Does GFP_KERNEL allocation.
|
|
|
|
*
|
2009-03-06 13:33:59 +08:00
|
|
|
* RETURNS:
|
|
|
|
* Percpu pointer to the allocated area on success, NULL on failure.
|
|
|
|
*/
|
2010-02-02 13:38:57 +08:00
|
|
|
void __percpu *__alloc_reserved_percpu(size_t size, size_t align)
|
2009-03-06 13:33:59 +08:00
|
|
|
{
|
2014-09-03 02:46:04 +08:00
|
|
|
return pcpu_alloc(size, align, true, GFP_KERNEL);
|
2009-03-06 13:33:59 +08:00
|
|
|
}
|
|
|
|
|
2009-03-06 23:44:11 +08:00
|
|
|
/**
|
2021-04-08 11:57:32 +08:00
|
|
|
* pcpu_balance_free - manage the amount of free chunks
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
* @empty_only: free chunks only if there are no populated pages
|
2009-03-06 23:44:11 +08:00
|
|
|
*
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
* If empty_only is %false, reclaim all fully free chunks regardless of the
|
|
|
|
* number of populated pages. Otherwise, only reclaim chunks that have no
|
|
|
|
* populated pages.
|
2021-06-18 03:03:22 +08:00
|
|
|
*
|
|
|
|
* CONTEXT:
|
|
|
|
* pcpu_lock (can be dropped temporarily)
|
2009-03-06 23:44:11 +08:00
|
|
|
*/
|
2021-06-03 09:09:31 +08:00
|
|
|
static void pcpu_balance_free(bool empty_only)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2014-09-03 02:46:05 +08:00
|
|
|
LIST_HEAD(to_free);
|
2021-06-03 09:09:31 +08:00
|
|
|
struct list_head *free_head = &pcpu_chunk_lists[pcpu_free_slot];
|
2009-03-06 23:44:11 +08:00
|
|
|
struct pcpu_chunk *chunk, *next;
|
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
lockdep_assert_held(&pcpu_lock);
|
2009-03-06 23:44:11 +08:00
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
/*
|
|
|
|
* There's no reason to keep around multiple unused chunks and VM
|
|
|
|
* areas can be scarce. Destroy all free chunks except for one.
|
|
|
|
*/
|
2014-09-03 02:46:05 +08:00
|
|
|
list_for_each_entry_safe(chunk, next, free_head, list) {
|
2009-03-06 23:44:11 +08:00
|
|
|
WARN_ON(chunk->immutable);
|
|
|
|
|
|
|
|
/* spare the first one */
|
2014-09-03 02:46:05 +08:00
|
|
|
if (chunk == list_first_entry(free_head, struct pcpu_chunk, list))
|
2009-03-06 23:44:11 +08:00
|
|
|
continue;
|
|
|
|
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
if (!empty_only || chunk->nr_empty_pop_pages == 0)
|
|
|
|
list_move(&chunk->list, &to_free);
|
2009-03-06 23:44:11 +08:00
|
|
|
}
|
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
if (list_empty(&to_free))
|
|
|
|
return;
|
2009-03-06 23:44:11 +08:00
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
spin_unlock_irq(&pcpu_lock);
|
2014-09-03 02:46:05 +08:00
|
|
|
list_for_each_entry_safe(chunk, next, &to_free, list) {
|
2019-12-14 08:22:10 +08:00
|
|
|
unsigned int rs, re;
|
2014-09-03 02:46:01 +08:00
|
|
|
|
2019-12-14 08:22:10 +08:00
|
|
|
bitmap_for_each_set_region(chunk->populated, rs, re, 0,
|
|
|
|
chunk->nr_pages) {
|
2014-09-03 02:46:02 +08:00
|
|
|
pcpu_depopulate_chunk(chunk, rs, re);
|
2014-09-03 02:46:05 +08:00
|
|
|
spin_lock_irq(&pcpu_lock);
|
|
|
|
pcpu_chunk_depopulated(chunk, rs, re);
|
|
|
|
spin_unlock_irq(&pcpu_lock);
|
2014-09-03 02:46:02 +08:00
|
|
|
}
|
2010-04-09 17:57:01 +08:00
|
|
|
pcpu_destroy_chunk(chunk);
|
2018-02-24 00:12:42 +08:00
|
|
|
cond_resched();
|
2009-03-06 23:44:11 +08:00
|
|
|
}
|
2021-06-18 03:03:22 +08:00
|
|
|
spin_lock_irq(&pcpu_lock);
|
2021-04-08 11:57:32 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_balance_populated - manage the amount of populated pages
|
|
|
|
*
|
|
|
|
* Maintain a certain amount of populated pages to satisfy atomic allocations.
|
|
|
|
* It is possible that this is called when physical memory is scarce causing
|
|
|
|
* OOM killer to be triggered. We should avoid doing so until an actual
|
|
|
|
* allocation causes the failure as it is possible that requests can be
|
|
|
|
* serviced from already backed regions.
|
2021-06-18 03:03:22 +08:00
|
|
|
*
|
|
|
|
* CONTEXT:
|
|
|
|
* pcpu_lock (can be dropped temporarily)
|
2021-04-08 11:57:32 +08:00
|
|
|
*/
|
2021-06-03 09:09:31 +08:00
|
|
|
static void pcpu_balance_populated(void)
|
2021-04-08 11:57:32 +08:00
|
|
|
{
|
|
|
|
/* gfp flags passed to underlying allocators */
|
|
|
|
const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
|
|
|
|
struct pcpu_chunk *chunk;
|
|
|
|
int slot, nr_to_pop, ret;
|
2009-08-14 14:00:49 +08:00
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
lockdep_assert_held(&pcpu_lock);
|
2009-08-14 14:00:49 +08:00
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
/*
|
|
|
|
* Ensure there are certain number of free populated pages for
|
|
|
|
* atomic allocs. Fill up from the most packed so that atomic
|
|
|
|
* allocs don't increase fragmentation. If atomic allocation
|
|
|
|
* failed previously, always populate the maximum amount. This
|
|
|
|
* should prevent atomic allocs larger than PAGE_SIZE from keeping
|
|
|
|
* failing indefinitely; however, large atomic allocs are not
|
|
|
|
* something we support properly and can be highly unreliable and
|
|
|
|
* inefficient.
|
|
|
|
*/
|
|
|
|
retry_pop:
|
|
|
|
if (pcpu_atomic_alloc_failed) {
|
|
|
|
nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
|
|
|
|
/* best effort anyway, don't worry about synchronization */
|
|
|
|
pcpu_atomic_alloc_failed = false;
|
|
|
|
} else {
|
|
|
|
nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_nr_empty_pop_pages,
|
2014-09-03 02:46:05 +08:00
|
|
|
0, PCPU_EMPTY_POP_PAGES_HIGH);
|
|
|
|
}
|
|
|
|
|
2021-04-19 06:44:16 +08:00
|
|
|
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot <= pcpu_free_slot; slot++) {
|
2019-12-14 08:22:10 +08:00
|
|
|
unsigned int nr_unpop = 0, rs, re;
|
2014-09-03 02:46:05 +08:00
|
|
|
|
|
|
|
if (!nr_to_pop)
|
|
|
|
break;
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
list_for_each_entry(chunk, &pcpu_chunk_lists[slot], list) {
|
2017-07-25 07:02:07 +08:00
|
|
|
nr_unpop = chunk->nr_pages - chunk->nr_populated;
|
2014-09-03 02:46:05 +08:00
|
|
|
if (nr_unpop)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!nr_unpop)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* @chunk can't go away while pcpu_alloc_mutex is held */
|
2019-12-14 08:22:10 +08:00
|
|
|
bitmap_for_each_clear_region(chunk->populated, rs, re, 0,
|
|
|
|
chunk->nr_pages) {
|
|
|
|
int nr = min_t(int, re - rs, nr_to_pop);
|
2014-09-03 02:46:05 +08:00
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
spin_unlock_irq(&pcpu_lock);
|
2018-02-17 02:07:19 +08:00
|
|
|
ret = pcpu_populate_chunk(chunk, rs, rs + nr, gfp);
|
2021-06-18 03:03:22 +08:00
|
|
|
cond_resched();
|
|
|
|
spin_lock_irq(&pcpu_lock);
|
2014-09-03 02:46:05 +08:00
|
|
|
if (!ret) {
|
|
|
|
nr_to_pop -= nr;
|
2019-02-14 03:10:30 +08:00
|
|
|
pcpu_chunk_populated(chunk, rs, rs + nr);
|
2014-09-03 02:46:05 +08:00
|
|
|
} else {
|
|
|
|
nr_to_pop = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!nr_to_pop)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nr_to_pop) {
|
|
|
|
/* ran out of chunks to populate, create a new one and retry */
|
2021-06-18 03:03:22 +08:00
|
|
|
spin_unlock_irq(&pcpu_lock);
|
2021-06-03 09:09:31 +08:00
|
|
|
chunk = pcpu_create_chunk(gfp);
|
2021-06-18 03:03:22 +08:00
|
|
|
cond_resched();
|
|
|
|
spin_lock_irq(&pcpu_lock);
|
2014-09-03 02:46:05 +08:00
|
|
|
if (chunk) {
|
|
|
|
pcpu_chunk_relocate(chunk, -1);
|
|
|
|
goto retry_pop;
|
|
|
|
}
|
|
|
|
}
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
2014-09-03 02:46:05 +08:00
|
|
|
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
/**
|
|
|
|
* pcpu_reclaim_populated - scan over to_depopulate chunks and free empty pages
|
|
|
|
*
|
|
|
|
* Scan over chunks in the depopulate list and try to release unused populated
|
|
|
|
* pages back to the system. Depopulated chunks are sidelined to prevent
|
|
|
|
* repopulating these pages unless required. Fully free chunks are reintegrated
|
|
|
|
* and freed accordingly (1 is kept around). If we drop below the empty
|
|
|
|
* populated pages threshold, reintegrate the chunk if it has empty free pages.
|
|
|
|
* Each chunk is scanned in the reverse order to keep populated pages close to
|
|
|
|
* the beginning of the chunk.
|
2021-06-18 03:03:22 +08:00
|
|
|
*
|
|
|
|
* CONTEXT:
|
|
|
|
* pcpu_lock (can be dropped temporarily)
|
|
|
|
*
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
*/
|
2021-06-03 09:09:31 +08:00
|
|
|
static void pcpu_reclaim_populated(void)
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
{
|
|
|
|
struct pcpu_chunk *chunk;
|
|
|
|
struct pcpu_block_md *block;
|
2021-07-03 11:49:57 +08:00
|
|
|
int freed_page_start, freed_page_end;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
int i, end;
|
2021-07-03 11:49:57 +08:00
|
|
|
bool reintegrate;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
lockdep_assert_held(&pcpu_lock);
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Once a chunk is isolated to the to_depopulate list, the chunk is no
|
|
|
|
* longer discoverable to allocations whom may populate pages. The only
|
|
|
|
* other accessor is the free path which only returns area back to the
|
|
|
|
* allocator not touching the populated bitmap.
|
|
|
|
*/
|
2021-06-03 09:09:31 +08:00
|
|
|
while (!list_empty(&pcpu_chunk_lists[pcpu_to_depopulate_slot])) {
|
|
|
|
chunk = list_first_entry(&pcpu_chunk_lists[pcpu_to_depopulate_slot],
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
struct pcpu_chunk, list);
|
|
|
|
WARN_ON(chunk->immutable);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Scan chunk's pages in the reverse order to keep populated
|
|
|
|
* pages close to the beginning of the chunk.
|
|
|
|
*/
|
2021-07-03 11:49:57 +08:00
|
|
|
freed_page_start = chunk->nr_pages;
|
|
|
|
freed_page_end = 0;
|
|
|
|
reintegrate = false;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
for (i = chunk->nr_pages - 1, end = -1; i >= 0; i--) {
|
|
|
|
/* no more work to do */
|
|
|
|
if (chunk->nr_empty_pop_pages == 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/* reintegrate chunk to prevent atomic alloc failures */
|
2021-06-03 09:09:31 +08:00
|
|
|
if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH) {
|
2021-07-03 11:49:57 +08:00
|
|
|
reintegrate = true;
|
|
|
|
goto end_chunk;
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the page is empty and populated, start or
|
|
|
|
* extend the (i, end) range. If i == 0, decrease
|
|
|
|
* i and perform the depopulation to cover the last
|
|
|
|
* (first) page in the chunk.
|
|
|
|
*/
|
|
|
|
block = chunk->md_blocks + i;
|
|
|
|
if (block->contig_hint == PCPU_BITMAP_BLOCK_BITS &&
|
|
|
|
test_bit(i, chunk->populated)) {
|
|
|
|
if (end == -1)
|
|
|
|
end = i;
|
|
|
|
if (i > 0)
|
|
|
|
continue;
|
|
|
|
i--;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* depopulate if there is an active range */
|
|
|
|
if (end == -1)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
spin_unlock_irq(&pcpu_lock);
|
|
|
|
pcpu_depopulate_chunk(chunk, i + 1, end + 1);
|
|
|
|
cond_resched();
|
|
|
|
spin_lock_irq(&pcpu_lock);
|
|
|
|
|
|
|
|
pcpu_chunk_depopulated(chunk, i + 1, end + 1);
|
2021-07-03 11:49:57 +08:00
|
|
|
freed_page_start = min(freed_page_start, i + 1);
|
|
|
|
freed_page_end = max(freed_page_end, end + 1);
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
|
|
|
|
/* reset the range and continue */
|
|
|
|
end = -1;
|
|
|
|
}
|
|
|
|
|
2021-07-03 11:49:57 +08:00
|
|
|
end_chunk:
|
|
|
|
/* batch tlb flush per chunk to amortize cost */
|
|
|
|
if (freed_page_start < freed_page_end) {
|
|
|
|
spin_unlock_irq(&pcpu_lock);
|
|
|
|
pcpu_post_unmap_tlb_flush(chunk,
|
|
|
|
freed_page_start,
|
|
|
|
freed_page_end);
|
|
|
|
cond_resched();
|
|
|
|
spin_lock_irq(&pcpu_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (reintegrate || chunk->free_bytes == pcpu_unit_size)
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
pcpu_reintegrate_chunk(chunk);
|
|
|
|
else
|
2021-07-03 11:49:57 +08:00
|
|
|
list_move_tail(&chunk->list,
|
|
|
|
&pcpu_chunk_lists[pcpu_sidelined_slot]);
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
}
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
/**
|
|
|
|
* pcpu_balance_workfn - manage the amount of free chunks and populated pages
|
|
|
|
* @work: unused
|
|
|
|
*
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
* For each chunk type, manage the number of fully free chunks and the number of
|
|
|
|
* populated pages. An important thing to consider is when pages are freed and
|
|
|
|
* how they contribute to the global counts.
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
*/
|
|
|
|
static void pcpu_balance_workfn(struct work_struct *work)
|
|
|
|
{
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
/*
|
|
|
|
* pcpu_balance_free() is called twice because the first time we may
|
|
|
|
* trim pages in the active pcpu_nr_empty_pop_pages which may cause us
|
|
|
|
* to grow other chunks. This then gives pcpu_reclaim_populated() time
|
|
|
|
* to move fully free chunks to the active list to be freed if
|
|
|
|
* appropriate.
|
|
|
|
*/
|
2021-06-03 09:09:31 +08:00
|
|
|
mutex_lock(&pcpu_alloc_mutex);
|
2021-06-18 03:03:22 +08:00
|
|
|
spin_lock_irq(&pcpu_lock);
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_balance_free(false);
|
|
|
|
pcpu_reclaim_populated();
|
|
|
|
pcpu_balance_populated();
|
|
|
|
pcpu_balance_free(true);
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
|
2021-06-18 03:03:22 +08:00
|
|
|
spin_unlock_irq(&pcpu_lock);
|
2021-06-03 09:09:31 +08:00
|
|
|
mutex_unlock(&pcpu_alloc_mutex);
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
}
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
|
|
|
* free_percpu - free percpu area
|
|
|
|
* @ptr: pointer to area to free
|
|
|
|
*
|
2009-03-06 23:44:13 +08:00
|
|
|
* Free percpu area @ptr.
|
|
|
|
*
|
|
|
|
* CONTEXT:
|
|
|
|
* Can be called from atomic context.
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
2010-02-02 13:38:57 +08:00
|
|
|
void free_percpu(void __percpu *ptr)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2010-01-09 06:42:39 +08:00
|
|
|
void *addr;
|
2009-02-20 15:29:08 +08:00
|
|
|
struct pcpu_chunk *chunk;
|
2009-03-06 23:44:13 +08:00
|
|
|
unsigned long flags;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
int size, off;
|
2019-05-08 09:43:20 +08:00
|
|
|
bool need_balance = false;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
|
|
|
if (!ptr)
|
|
|
|
return;
|
|
|
|
|
2011-09-27 00:12:53 +08:00
|
|
|
kmemleak_free_percpu(ptr);
|
|
|
|
|
2010-01-09 06:42:39 +08:00
|
|
|
addr = __pcpu_ptr_to_addr(ptr);
|
|
|
|
|
2009-03-06 23:44:13 +08:00
|
|
|
spin_lock_irqsave(&pcpu_lock, flags);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
|
|
|
chunk = pcpu_chunk_addr_search(addr);
|
2009-08-14 14:00:51 +08:00
|
|
|
off = addr - chunk->base_addr;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
size = pcpu_free_area(chunk, off);
|
|
|
|
|
|
|
|
pcpu_memcg_free_hook(chunk, off, size);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
/*
|
|
|
|
* If there are more than one fully free chunks, wake up grim reaper.
|
|
|
|
* If the chunk is isolated, it may be in the process of being
|
|
|
|
* reclaimed. Let reclaim manage cleaning up of that chunk.
|
|
|
|
*/
|
|
|
|
if (!chunk->isolated && chunk->free_bytes == pcpu_unit_size) {
|
2009-02-20 15:29:08 +08:00
|
|
|
struct pcpu_chunk *pos;
|
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
list_for_each_entry(pos, &pcpu_chunk_lists[pcpu_free_slot], list)
|
2009-02-20 15:29:08 +08:00
|
|
|
if (pos != chunk) {
|
2019-05-08 09:43:20 +08:00
|
|
|
need_balance = true;
|
2009-02-20 15:29:08 +08:00
|
|
|
break;
|
|
|
|
}
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
} else if (pcpu_should_reclaim_chunk(chunk)) {
|
|
|
|
pcpu_isolate_chunk(chunk);
|
|
|
|
need_balance = true;
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
|
2017-06-20 07:28:32 +08:00
|
|
|
trace_percpu_free_percpu(chunk->base_addr, off, ptr);
|
|
|
|
|
2009-03-06 23:44:13 +08:00
|
|
|
spin_unlock_irqrestore(&pcpu_lock, flags);
|
2019-05-08 09:43:20 +08:00
|
|
|
|
|
|
|
if (need_balance)
|
|
|
|
pcpu_schedule_balance_work();
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(free_percpu);
|
|
|
|
|
2017-02-27 22:37:36 +08:00
|
|
|
bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr)
|
2010-03-10 17:57:54 +08:00
|
|
|
{
|
2010-09-04 00:22:48 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2010-03-10 17:57:54 +08:00
|
|
|
const size_t static_size = __per_cpu_end - __per_cpu_start;
|
|
|
|
void __percpu *base = __addr_to_pcpu_ptr(pcpu_base_addr);
|
|
|
|
unsigned int cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
void *start = per_cpu_ptr(base, cpu);
|
2017-02-27 22:37:36 +08:00
|
|
|
void *va = (void *)addr;
|
2010-03-10 17:57:54 +08:00
|
|
|
|
2017-02-27 22:37:36 +08:00
|
|
|
if (va >= start && va < start + static_size) {
|
2017-03-20 19:26:55 +08:00
|
|
|
if (can_addr) {
|
2017-02-27 22:37:36 +08:00
|
|
|
*can_addr = (unsigned long) (va - start);
|
2017-03-20 19:26:55 +08:00
|
|
|
*can_addr += (unsigned long)
|
|
|
|
per_cpu_ptr(base, get_boot_cpu_id());
|
|
|
|
}
|
2010-03-10 17:57:54 +08:00
|
|
|
return true;
|
2017-02-27 22:37:36 +08:00
|
|
|
}
|
|
|
|
}
|
2010-09-04 00:22:48 +08:00
|
|
|
#endif
|
|
|
|
/* on UP, can't distinguish from other static vars, always false */
|
2010-03-10 17:57:54 +08:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-02-27 22:37:36 +08:00
|
|
|
/**
|
|
|
|
* is_kernel_percpu_address - test whether address is from static percpu area
|
|
|
|
* @addr: address to test
|
|
|
|
*
|
|
|
|
* Test whether @addr belongs to in-kernel static percpu area. Module
|
|
|
|
* static percpu areas are not considered. For those, use
|
|
|
|
* is_module_percpu_address().
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* %true if @addr is from in-kernel static percpu area, %false otherwise.
|
|
|
|
*/
|
|
|
|
bool is_kernel_percpu_address(unsigned long addr)
|
|
|
|
{
|
|
|
|
return __is_kernel_percpu_address(addr, NULL);
|
|
|
|
}
|
|
|
|
|
2009-11-24 14:50:03 +08:00
|
|
|
/**
|
|
|
|
* per_cpu_ptr_to_phys - convert translated percpu address to physical address
|
|
|
|
* @addr: the address to be converted to physical address
|
|
|
|
*
|
|
|
|
* Given @addr which is dereferenceable address obtained via one of
|
|
|
|
* percpu access macros, this function translates it into its physical
|
|
|
|
* address. The caller is responsible for ensuring @addr stays valid
|
|
|
|
* until this function finishes.
|
|
|
|
*
|
2011-11-24 00:20:53 +08:00
|
|
|
* percpu allocator has special setup for the first chunk, which currently
|
|
|
|
* supports either embedding in linear address space or vmalloc mapping,
|
|
|
|
* and, from the second one, the backing allocator (currently either vm or
|
|
|
|
* km) provides translation.
|
|
|
|
*
|
2015-03-07 06:30:42 +08:00
|
|
|
* The addr can be translated simply without checking if it falls into the
|
2011-11-24 00:20:53 +08:00
|
|
|
* first chunk. But the current code reflects better how percpu allocator
|
|
|
|
* actually works, and the verification can discover both bugs in percpu
|
|
|
|
* allocator itself and per_cpu_ptr_to_phys() callers. So we keep current
|
|
|
|
* code.
|
|
|
|
*
|
2009-11-24 14:50:03 +08:00
|
|
|
* RETURNS:
|
|
|
|
* The physical address for @addr.
|
|
|
|
*/
|
|
|
|
phys_addr_t per_cpu_ptr_to_phys(void *addr)
|
|
|
|
{
|
2010-06-18 17:44:31 +08:00
|
|
|
void __percpu *base = __addr_to_pcpu_ptr(pcpu_base_addr);
|
|
|
|
bool in_first_chunk = false;
|
2011-11-19 02:55:35 +08:00
|
|
|
unsigned long first_low, first_high;
|
2010-06-18 17:44:31 +08:00
|
|
|
unsigned int cpu;
|
|
|
|
|
|
|
|
/*
|
2011-11-19 02:55:35 +08:00
|
|
|
* The following test on unit_low/high isn't strictly
|
2010-06-18 17:44:31 +08:00
|
|
|
* necessary but will speed up lookups of addresses which
|
|
|
|
* aren't in the first chunk.
|
2017-07-25 07:02:05 +08:00
|
|
|
*
|
|
|
|
* The address check is against full chunk sizes. pcpu_base_addr
|
|
|
|
* points to the beginning of the first chunk including the
|
|
|
|
* static region. Assumes good intent as the first chunk may
|
|
|
|
* not be full (ie. < pcpu_unit_pages in size).
|
2010-06-18 17:44:31 +08:00
|
|
|
*/
|
2017-07-25 07:02:05 +08:00
|
|
|
first_low = (unsigned long)pcpu_base_addr +
|
|
|
|
pcpu_unit_page_offset(pcpu_low_unit_cpu, 0);
|
|
|
|
first_high = (unsigned long)pcpu_base_addr +
|
|
|
|
pcpu_unit_page_offset(pcpu_high_unit_cpu, pcpu_unit_pages);
|
2011-11-19 02:55:35 +08:00
|
|
|
if ((unsigned long)addr >= first_low &&
|
|
|
|
(unsigned long)addr < first_high) {
|
2010-06-18 17:44:31 +08:00
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
void *start = per_cpu_ptr(base, cpu);
|
|
|
|
|
|
|
|
if (addr >= start && addr < start + pcpu_unit_size) {
|
|
|
|
in_first_chunk = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (in_first_chunk) {
|
2011-03-28 19:53:29 +08:00
|
|
|
if (!is_vmalloc_addr(addr))
|
2010-04-09 17:57:00 +08:00
|
|
|
return __pa(addr);
|
|
|
|
else
|
2011-12-16 03:25:59 +08:00
|
|
|
return page_to_phys(vmalloc_to_page(addr)) +
|
|
|
|
offset_in_page(addr);
|
2010-04-09 17:57:00 +08:00
|
|
|
} else
|
2011-12-16 03:25:59 +08:00
|
|
|
return page_to_phys(pcpu_addr_to_page(addr)) +
|
|
|
|
offset_in_page(addr);
|
2009-11-24 14:50:03 +08:00
|
|
|
}
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* pcpu_alloc_alloc_info - allocate percpu allocation info
|
|
|
|
* @nr_groups: the number of groups
|
|
|
|
* @nr_units: the number of units
|
|
|
|
*
|
|
|
|
* Allocate ai which is large enough for @nr_groups groups containing
|
|
|
|
* @nr_units units. The returned ai's groups[0].cpu_map points to the
|
|
|
|
* cpu_map array which is long enough for @nr_units and filled with
|
|
|
|
* NR_CPUS. It's the caller's responsibility to initialize cpu_map
|
|
|
|
* pointer of other groups.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Pointer to the allocated pcpu_alloc_info on success, NULL on
|
|
|
|
* failure.
|
|
|
|
*/
|
|
|
|
struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
|
|
|
|
int nr_units)
|
|
|
|
{
|
|
|
|
struct pcpu_alloc_info *ai;
|
|
|
|
size_t base_size, ai_size;
|
|
|
|
void *ptr;
|
|
|
|
int unit;
|
|
|
|
|
2019-08-30 03:06:05 +08:00
|
|
|
base_size = ALIGN(struct_size(ai, groups, nr_groups),
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
__alignof__(ai->groups[0].cpu_map[0]));
|
|
|
|
ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);
|
|
|
|
|
2019-03-12 14:30:42 +08:00
|
|
|
ptr = memblock_alloc(PFN_ALIGN(ai_size), PAGE_SIZE);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
if (!ptr)
|
|
|
|
return NULL;
|
|
|
|
ai = ptr;
|
|
|
|
ptr += base_size;
|
|
|
|
|
|
|
|
ai->groups[0].cpu_map = ptr;
|
|
|
|
|
|
|
|
for (unit = 0; unit < nr_units; unit++)
|
|
|
|
ai->groups[0].cpu_map[unit] = NR_CPUS;
|
|
|
|
|
|
|
|
ai->nr_groups = nr_groups;
|
|
|
|
ai->__ai_size = PFN_ALIGN(ai_size);
|
|
|
|
|
|
|
|
return ai;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_free_alloc_info - free percpu allocation info
|
|
|
|
* @ai: pcpu_alloc_info to free
|
|
|
|
*
|
|
|
|
* Free @ai which was allocated by pcpu_alloc_alloc_info().
|
|
|
|
*/
|
|
|
|
void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
|
|
|
|
{
|
2021-11-06 04:43:22 +08:00
|
|
|
memblock_free(ai, ai->__ai_size);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* pcpu_dump_alloc_info - print out information about pcpu_alloc_info
|
|
|
|
* @lvl: loglevel
|
|
|
|
* @ai: allocation info to dump
|
|
|
|
*
|
|
|
|
* Print out information about @ai using loglevel @lvl.
|
|
|
|
*/
|
|
|
|
static void pcpu_dump_alloc_info(const char *lvl,
|
|
|
|
const struct pcpu_alloc_info *ai)
|
2009-08-14 14:00:51 +08:00
|
|
|
{
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
int group_width = 1, cpu_width = 1, width;
|
2009-08-14 14:00:51 +08:00
|
|
|
char empty_str[] = "--------";
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
int alloc = 0, alloc_end = 0;
|
|
|
|
int group, v;
|
|
|
|
int upa, apl; /* units per alloc, allocs per line */
|
|
|
|
|
|
|
|
v = ai->nr_groups;
|
|
|
|
while (v /= 10)
|
|
|
|
group_width++;
|
2009-08-14 14:00:51 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
v = num_possible_cpus();
|
2009-08-14 14:00:51 +08:00
|
|
|
while (v /= 10)
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
cpu_width++;
|
|
|
|
empty_str[min_t(int, cpu_width, sizeof(empty_str) - 1)] = '\0';
|
2009-08-14 14:00:51 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
upa = ai->alloc_size / ai->unit_size;
|
|
|
|
width = upa * (cpu_width + 1) + group_width + 3;
|
|
|
|
apl = rounddown_pow_of_two(max(60 / width, 1));
|
2009-08-14 14:00:51 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
printk("%spcpu-alloc: s%zu r%zu d%zu u%zu alloc=%zu*%zu",
|
|
|
|
lvl, ai->static_size, ai->reserved_size, ai->dyn_size,
|
|
|
|
ai->unit_size, ai->alloc_size / ai->atom_size, ai->atom_size);
|
2009-08-14 14:00:51 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
for (group = 0; group < ai->nr_groups; group++) {
|
|
|
|
const struct pcpu_group_info *gi = &ai->groups[group];
|
|
|
|
int unit = 0, unit_end = 0;
|
|
|
|
|
|
|
|
BUG_ON(gi->nr_units % upa);
|
|
|
|
for (alloc_end += gi->nr_units / upa;
|
|
|
|
alloc < alloc_end; alloc++) {
|
|
|
|
if (!(alloc % apl)) {
|
2016-03-18 05:19:50 +08:00
|
|
|
pr_cont("\n");
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
printk("%spcpu-alloc: ", lvl);
|
|
|
|
}
|
2016-03-18 05:19:50 +08:00
|
|
|
pr_cont("[%0*d] ", group_width, group);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
|
|
|
|
for (unit_end += upa; unit < unit_end; unit++)
|
|
|
|
if (gi->cpu_map[unit] != NR_CPUS)
|
2016-03-18 05:19:50 +08:00
|
|
|
pr_cont("%0*d ",
|
|
|
|
cpu_width, gi->cpu_map[unit]);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
else
|
2016-03-18 05:19:50 +08:00
|
|
|
pr_cont("%s ", empty_str);
|
2009-08-14 14:00:51 +08:00
|
|
|
}
|
|
|
|
}
|
2016-03-18 05:19:50 +08:00
|
|
|
pr_cont("\n");
|
2009-08-14 14:00:51 +08:00
|
|
|
}
|
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/**
|
2009-02-24 10:57:21 +08:00
|
|
|
* pcpu_setup_first_chunk - initialize the first percpu chunk
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* @ai: pcpu_alloc_info describing how to percpu area is shaped
|
2009-07-04 07:10:59 +08:00
|
|
|
* @base_addr: mapped address
|
2009-02-24 10:57:21 +08:00
|
|
|
*
|
|
|
|
* Initialize the first percpu chunk which contains the kernel static
|
2019-07-21 17:56:33 +08:00
|
|
|
* percpu area. This function is to be called from arch percpu area
|
2009-07-04 07:10:59 +08:00
|
|
|
* setup path.
|
2009-02-24 10:57:21 +08:00
|
|
|
*
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* @ai contains all information necessary to initialize the first
|
|
|
|
* chunk and prime the dynamic percpu allocator.
|
|
|
|
*
|
|
|
|
* @ai->static_size is the size of static percpu area.
|
|
|
|
*
|
|
|
|
* @ai->reserved_size, if non-zero, specifies the amount of bytes to
|
2009-03-06 13:33:59 +08:00
|
|
|
* reserve after the static area in the first chunk. This reserves
|
|
|
|
* the first chunk such that it's available only through reserved
|
|
|
|
* percpu allocation. This is primarily used to serve module percpu
|
|
|
|
* static areas on architectures where the addressing model has
|
|
|
|
* limited offset range for symbol relocations to guarantee module
|
|
|
|
* percpu symbols fall inside the relocatable range.
|
|
|
|
*
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* @ai->dyn_size determines the number of bytes available for dynamic
|
|
|
|
* allocation in the first chunk. The area between @ai->static_size +
|
|
|
|
* @ai->reserved_size + @ai->dyn_size and @ai->unit_size is unused.
|
2009-03-10 15:27:48 +08:00
|
|
|
*
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* @ai->unit_size specifies unit size and must be aligned to PAGE_SIZE
|
|
|
|
* and equal to or larger than @ai->static_size + @ai->reserved_size +
|
|
|
|
* @ai->dyn_size.
|
2009-02-24 10:57:21 +08:00
|
|
|
*
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* @ai->atom_size is the allocation atom size and used as alignment
|
|
|
|
* for vm areas.
|
2009-02-24 10:57:21 +08:00
|
|
|
*
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
* @ai->alloc_size is the allocation size and always multiple of
|
|
|
|
* @ai->atom_size. This is larger than @ai->atom_size if
|
|
|
|
* @ai->unit_size is larger than @ai->atom_size.
|
|
|
|
*
|
|
|
|
* @ai->nr_groups and @ai->groups describe virtual memory layout of
|
|
|
|
* percpu areas. Units which should be colocated are put into the
|
|
|
|
* same group. Dynamic VM areas will be allocated according to these
|
|
|
|
* groupings. If @ai->nr_groups is zero, a single group containing
|
|
|
|
* all units is assumed.
|
2009-02-24 10:57:21 +08:00
|
|
|
*
|
2009-07-04 07:10:59 +08:00
|
|
|
* The caller should have mapped the first chunk at @base_addr and
|
|
|
|
* copied static data to each unit.
|
2009-02-20 15:29:08 +08:00
|
|
|
*
|
2017-07-25 07:02:05 +08:00
|
|
|
* The first chunk will always contain a static and a dynamic region.
|
|
|
|
* However, the static region is not managed by any chunk. If the first
|
|
|
|
* chunk also contains a reserved region, it is served by two chunks -
|
|
|
|
* one for the reserved region and one for the dynamic region. They
|
|
|
|
* share the same vm, but use offset regions in the area allocation map.
|
|
|
|
* The chunk serving the dynamic region is circulated in the chunk slots
|
|
|
|
* and available for dynamic allocation like any other chunk.
|
2009-02-20 15:29:08 +08:00
|
|
|
*/
|
2019-07-03 16:25:52 +08:00
|
|
|
void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
|
|
|
|
void *base_addr)
|
2009-02-20 15:29:08 +08:00
|
|
|
{
|
2017-07-25 07:02:01 +08:00
|
|
|
size_t size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
|
2017-07-25 07:02:09 +08:00
|
|
|
size_t static_size, dyn_size;
|
2017-07-25 07:02:04 +08:00
|
|
|
struct pcpu_chunk *chunk;
|
2009-08-14 14:00:52 +08:00
|
|
|
unsigned long *group_offsets;
|
|
|
|
size_t *group_sizes;
|
2009-08-14 14:00:51 +08:00
|
|
|
unsigned long *unit_off;
|
2009-02-20 15:29:08 +08:00
|
|
|
unsigned int cpu;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
int *unit_map;
|
|
|
|
int group, unit, i;
|
2017-07-25 07:02:05 +08:00
|
|
|
int map_size;
|
|
|
|
unsigned long tmp_addr;
|
2019-03-12 14:30:15 +08:00
|
|
|
size_t alloc_size;
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2009-09-24 08:43:11 +08:00
|
|
|
#define PCPU_SETUP_BUG_ON(cond) do { \
|
|
|
|
if (unlikely(cond)) { \
|
2016-03-18 05:19:53 +08:00
|
|
|
pr_emerg("failed to initialize, %s\n", #cond); \
|
|
|
|
pr_emerg("cpu_possible_mask=%*pb\n", \
|
2015-02-14 06:37:34 +08:00
|
|
|
cpumask_pr_args(cpu_possible_mask)); \
|
2009-09-24 08:43:11 +08:00
|
|
|
pcpu_dump_alloc_info(KERN_EMERG, ai); \
|
|
|
|
BUG(); \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
2009-07-04 07:11:00 +08:00
|
|
|
/* sanity checks */
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
|
2010-09-04 00:22:48 +08:00
|
|
|
#ifdef CONFIG_SMP
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!ai->static_size);
|
2015-11-06 10:46:43 +08:00
|
|
|
PCPU_SETUP_BUG_ON(offset_in_page(__per_cpu_start));
|
2010-09-04 00:22:48 +08:00
|
|
|
#endif
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!base_addr);
|
2015-11-06 10:46:43 +08:00
|
|
|
PCPU_SETUP_BUG_ON(offset_in_page(base_addr));
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
|
2015-11-06 10:46:43 +08:00
|
|
|
PCPU_SETUP_BUG_ON(offset_in_page(ai->unit_size));
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
|
2017-07-25 07:02:12 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!IS_ALIGNED(ai->unit_size, PCPU_BITMAP_BLOCK_SIZE));
|
2010-06-28 00:50:00 +08:00
|
|
|
PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE);
|
2017-07-25 07:01:58 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!ai->dyn_size);
|
2017-07-25 07:02:09 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!IS_ALIGNED(ai->reserved_size, PCPU_MIN_ALLOC_SIZE));
|
2017-07-25 07:02:12 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!(IS_ALIGNED(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) ||
|
|
|
|
IS_ALIGNED(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE)));
|
2010-04-09 17:57:01 +08:00
|
|
|
PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0);
|
2009-02-24 10:57:21 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
/* process group information and build config tables accordingly */
|
2019-03-12 14:30:15 +08:00
|
|
|
alloc_size = ai->nr_groups * sizeof(group_offsets[0]);
|
|
|
|
group_offsets = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!group_offsets)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
|
|
|
|
|
|
|
alloc_size = ai->nr_groups * sizeof(group_sizes[0]);
|
|
|
|
group_sizes = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!group_sizes)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
|
|
|
|
|
|
|
alloc_size = nr_cpu_ids * sizeof(unit_map[0]);
|
|
|
|
unit_map = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!unit_map)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
|
|
|
|
|
|
|
alloc_size = nr_cpu_ids * sizeof(unit_off[0]);
|
|
|
|
unit_off = memblock_alloc(alloc_size, SMP_CACHE_BYTES);
|
|
|
|
if (!unit_off)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
alloc_size);
|
2009-07-04 07:11:00 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
for (cpu = 0; cpu < nr_cpu_ids; cpu++)
|
2009-09-29 08:17:56 +08:00
|
|
|
unit_map[cpu] = UINT_MAX;
|
2011-11-19 02:55:35 +08:00
|
|
|
|
|
|
|
pcpu_low_unit_cpu = NR_CPUS;
|
|
|
|
pcpu_high_unit_cpu = NR_CPUS;
|
2009-07-04 07:11:00 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
|
|
|
|
const struct pcpu_group_info *gi = &ai->groups[group];
|
2009-07-04 07:11:00 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
group_offsets[group] = gi->base_offset;
|
|
|
|
group_sizes[group] = gi->nr_units * ai->unit_size;
|
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
for (i = 0; i < gi->nr_units; i++) {
|
|
|
|
cpu = gi->cpu_map[i];
|
|
|
|
if (cpu == NR_CPUS)
|
|
|
|
continue;
|
2009-02-24 10:57:21 +08:00
|
|
|
|
2014-10-29 16:45:04 +08:00
|
|
|
PCPU_SETUP_BUG_ON(cpu >= nr_cpu_ids);
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(!cpu_possible(cpu));
|
|
|
|
PCPU_SETUP_BUG_ON(unit_map[cpu] != UINT_MAX);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
unit_map[cpu] = unit + i;
|
2009-08-14 14:00:51 +08:00
|
|
|
unit_off[cpu] = gi->base_offset + i * ai->unit_size;
|
|
|
|
|
2011-11-19 02:55:35 +08:00
|
|
|
/* determine low/high unit_cpu */
|
|
|
|
if (pcpu_low_unit_cpu == NR_CPUS ||
|
|
|
|
unit_off[cpu] < unit_off[pcpu_low_unit_cpu])
|
|
|
|
pcpu_low_unit_cpu = cpu;
|
|
|
|
if (pcpu_high_unit_cpu == NR_CPUS ||
|
|
|
|
unit_off[cpu] > unit_off[pcpu_high_unit_cpu])
|
|
|
|
pcpu_high_unit_cpu = cpu;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
}
|
2009-07-04 07:11:00 +08:00
|
|
|
}
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_nr_units = unit;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu)
|
2009-09-24 08:43:11 +08:00
|
|
|
PCPU_SETUP_BUG_ON(unit_map[cpu] == UINT_MAX);
|
|
|
|
|
|
|
|
/* we're done parsing the input, undefine BUG macro and dump config */
|
|
|
|
#undef PCPU_SETUP_BUG_ON
|
2010-12-22 21:19:14 +08:00
|
|
|
pcpu_dump_alloc_info(KERN_DEBUG, ai);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
pcpu_nr_groups = ai->nr_groups;
|
|
|
|
pcpu_group_offsets = group_offsets;
|
|
|
|
pcpu_group_sizes = group_sizes;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_unit_map = unit_map;
|
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_unit_offsets = unit_off;
|
2009-07-04 07:11:00 +08:00
|
|
|
|
|
|
|
/* determine basic parameters */
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_unit_pages = ai->unit_size >> PAGE_SHIFT;
|
2009-02-24 10:57:21 +08:00
|
|
|
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
|
2009-08-14 14:00:52 +08:00
|
|
|
pcpu_atom_size = ai->atom_size;
|
2020-10-31 04:40:21 +08:00
|
|
|
pcpu_chunk_struct_size = struct_size(chunk, populated,
|
|
|
|
BITS_TO_LONGS(pcpu_unit_pages));
|
2009-03-06 13:33:59 +08:00
|
|
|
|
2017-06-20 07:28:31 +08:00
|
|
|
pcpu_stats_save_ai(ai);
|
|
|
|
|
2009-02-24 10:57:21 +08:00
|
|
|
/*
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
* Allocate chunk slots. The slots after the active slots are:
|
|
|
|
* sidelined_slot - isolated, depopulated chunks
|
|
|
|
* free_slot - fully free chunks
|
|
|
|
* to_depopulate_slot - isolated, chunks to depopulate
|
2009-02-24 10:57:21 +08:00
|
|
|
*/
|
percpu: implement partial chunk depopulation
From Roman ("percpu: partial chunk depopulation"):
In our [Facebook] production experience the percpu memory allocator is
sometimes struggling with returning the memory to the system. A typical
example is a creation of several thousands memory cgroups (each has
several chunks of the percpu data used for vmstats, vmevents,
ref counters etc). Deletion and complete releasing of these cgroups
doesn't always lead to a shrinkage of the percpu memory, so that
sometimes there are several GB's of memory wasted.
The underlying problem is the fragmentation: to release an underlying
chunk all percpu allocations should be released first. The percpu
allocator tends to top up chunks to improve the utilization. It means
new small-ish allocations (e.g. percpu ref counters) are placed onto
almost filled old-ish chunks, effectively pinning them in memory.
This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.
To illustrate the problem the following script can be used:
--
cd /sys/fs/cgroup
mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done
sleep 10
cat /proc/meminfo | grep Percpu
for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done
rmdir percpu_test
--
It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.
Results:
vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481152 kB
Percpu: 481152 kB
with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu: 481408 kB
Percpu: 135552 kB
The total size of the percpu memory was reduced by more than 3.5 times.
This patch:
This patch implements partial depopulation of percpu chunks.
As of now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.
This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations prefer using
active chunks to sidelined chunks. If a sidelined chunk is used, it is
reintegrated to the active lists.
The depopulation is scheduled on the free path if the chunk is all of
the following:
1) has more than 1/4 of total pages free and populated
2) the system has enough free percpu pages aside of this chunk
3) isn't the reserved chunk
4) isn't the first chunk
If it's already depopulated but got free populated pages, it's a good
target too. The chunk is moved to a special slot,
pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
item is scheduled. On isolation, these pages are removed from the
pcpu_nr_empty_pop_pages. It is constantly replaced to the
to_depopulate_slot when it meets these qualifications.
pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
becomes empty. The depopulation is performed in the reverse direction to
keep populated pages close to the beginning. Depopulated chunks are
sidelined to preferentially avoid them for new allocations. When no
active chunk can suffice a new allocation, sidelined chunks are first
checked before creating a new chunk.
Signed-off-by: Roman Gushchin <guro@fb.com>
Co-developed-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Tested-by: Pratik Sampat <psampat@linux.ibm.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
2021-04-08 11:57:36 +08:00
|
|
|
pcpu_sidelined_slot = __pcpu_size_to_slot(pcpu_unit_size) + 1;
|
|
|
|
pcpu_free_slot = pcpu_sidelined_slot + 1;
|
|
|
|
pcpu_to_depopulate_slot = pcpu_free_slot + 1;
|
|
|
|
pcpu_nr_slots = pcpu_to_depopulate_slot + 1;
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots *
|
2021-06-03 09:09:31 +08:00
|
|
|
sizeof(pcpu_chunk_lists[0]),
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
SMP_CACHE_BYTES);
|
|
|
|
if (!pcpu_chunk_lists)
|
2019-03-12 14:30:15 +08:00
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]));
|
mm: memcg/percpu: account percpu memory to memory cgroups
Percpu memory is becoming more and more widely used by various subsystems,
and the total amount of memory controlled by the percpu allocator can make
a good part of the total memory.
As an example, bpf maps can consume a lot of percpu memory, and they are
created by a user. Also, some cgroup internals (e.g. memory controller
statistics) can be quite large. On a machine with many CPUs and big
number of cgroups they can consume hundreds of megabytes.
So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be accounted
by default.
To implement the perpcu accounting it's possible to take the slab memory
accounting as a model to follow. Let's introduce two types of percpu
chunks: root and memcg. What makes memcg chunks different is an
additional space allocated to store memcg membership information. If
__GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
If it's possible to charge the corresponding size to the target memory
cgroup, allocation is performed, and the memcg ownership data is recorded.
System-wide allocations are performed using root chunks, so there is no
additional memory overhead.
To implement a fast reparenting of percpu memory on memcg removal, we
don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
introduced for slab accounting.
[akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
[akpm@linux-foundation.org: move unreachable code, per Roman]
[cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tobin C. Harding <tobin@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: Bixuan Cui <cuibixuan@huawei.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 09:30:17 +08:00
|
|
|
|
2021-06-03 09:09:31 +08:00
|
|
|
for (i = 0; i < pcpu_nr_slots; i++)
|
|
|
|
INIT_LIST_HEAD(&pcpu_chunk_lists[i]);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2017-07-25 07:02:09 +08:00
|
|
|
/*
|
|
|
|
* The end of the static region needs to be aligned with the
|
|
|
|
* minimum allocation size as this offsets the reserved and
|
|
|
|
* dynamic region. The first chunk ends page aligned by
|
|
|
|
* expanding the dynamic region, therefore the dynamic region
|
|
|
|
* can be shrunk to compensate while still staying above the
|
|
|
|
* configured sizes.
|
|
|
|
*/
|
|
|
|
static_size = ALIGN(ai->static_size, PCPU_MIN_ALLOC_SIZE);
|
|
|
|
dyn_size = ai->dyn_size - (static_size - ai->static_size);
|
|
|
|
|
2009-03-06 13:33:59 +08:00
|
|
|
/*
|
2017-07-25 07:02:05 +08:00
|
|
|
* Initialize first chunk.
|
|
|
|
* If the reserved_size is non-zero, this initializes the reserved
|
|
|
|
* chunk. If the reserved_size is zero, the reserved chunk is NULL
|
|
|
|
* and the dynamic region is initialized here. The first chunk,
|
|
|
|
* pcpu_first_chunk, will always point to the chunk that serves
|
|
|
|
* the dynamic region.
|
2009-03-06 13:33:59 +08:00
|
|
|
*/
|
2017-07-25 07:02:09 +08:00
|
|
|
tmp_addr = (unsigned long)base_addr + static_size;
|
|
|
|
map_size = ai->reserved_size ?: dyn_size;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
chunk = pcpu_alloc_first_chunk(tmp_addr, map_size);
|
2009-03-06 13:33:59 +08:00
|
|
|
|
2009-03-06 13:33:59 +08:00
|
|
|
/* init dynamic chunk if necessary */
|
2017-07-25 07:02:01 +08:00
|
|
|
if (ai->reserved_size) {
|
2017-07-25 07:02:04 +08:00
|
|
|
pcpu_reserved_chunk = chunk;
|
2017-07-25 07:02:01 +08:00
|
|
|
|
2017-07-25 07:02:09 +08:00
|
|
|
tmp_addr = (unsigned long)base_addr + static_size +
|
2017-07-25 07:02:05 +08:00
|
|
|
ai->reserved_size;
|
2017-07-25 07:02:09 +08:00
|
|
|
map_size = dyn_size;
|
percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-13 02:27:32 +08:00
|
|
|
chunk = pcpu_alloc_first_chunk(tmp_addr, map_size);
|
2009-03-06 13:33:59 +08:00
|
|
|
}
|
|
|
|
|
2009-03-06 13:33:59 +08:00
|
|
|
/* link the first chunk in */
|
2017-07-25 07:02:04 +08:00
|
|
|
pcpu_first_chunk = chunk;
|
2021-06-03 09:09:31 +08:00
|
|
|
pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
|
2009-04-02 12:19:54 +08:00
|
|
|
pcpu_chunk_relocate(pcpu_first_chunk, -1);
|
2009-02-20 15:29:08 +08:00
|
|
|
|
2018-08-22 12:53:58 +08:00
|
|
|
/* include all regions of the first chunk */
|
|
|
|
pcpu_nr_populated += PFN_DOWN(size_sum);
|
|
|
|
|
2017-06-20 07:28:31 +08:00
|
|
|
pcpu_stats_chunk_alloc();
|
2017-06-20 07:28:32 +08:00
|
|
|
trace_percpu_create_chunk(base_addr);
|
2017-06-20 07:28:31 +08:00
|
|
|
|
2009-02-20 15:29:08 +08:00
|
|
|
/* we're done */
|
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_base_addr = base_addr;
|
2009-02-20 15:29:08 +08:00
|
|
|
}
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2010-09-04 00:22:48 +08:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
|
2012-10-05 08:12:07 +08:00
|
|
|
const char * const pcpu_fc_names[PCPU_FC_NR] __initconst = {
|
2009-08-14 14:00:50 +08:00
|
|
|
[PCPU_FC_AUTO] = "auto",
|
|
|
|
[PCPU_FC_EMBED] = "embed",
|
|
|
|
[PCPU_FC_PAGE] = "page",
|
|
|
|
};
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-08-14 14:00:50 +08:00
|
|
|
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-08-14 14:00:50 +08:00
|
|
|
static int __init percpu_alloc_setup(char *str)
|
|
|
|
{
|
2012-11-25 05:17:13 +08:00
|
|
|
if (!str)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2009-08-14 14:00:50 +08:00
|
|
|
if (0)
|
|
|
|
/* nada */;
|
|
|
|
#ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
|
|
|
|
else if (!strcmp(str, "embed"))
|
|
|
|
pcpu_chosen_fc = PCPU_FC_EMBED;
|
|
|
|
#endif
|
|
|
|
#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
|
|
|
|
else if (!strcmp(str, "page"))
|
|
|
|
pcpu_chosen_fc = PCPU_FC_PAGE;
|
|
|
|
#endif
|
|
|
|
else
|
2016-03-18 05:19:53 +08:00
|
|
|
pr_warn("unknown allocator %s specified\n", str);
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-08-14 14:00:50 +08:00
|
|
|
return 0;
|
2009-03-10 15:27:48 +08:00
|
|
|
}
|
2009-08-14 14:00:50 +08:00
|
|
|
early_param("percpu_alloc", percpu_alloc_setup);
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2010-09-10 00:00:15 +08:00
|
|
|
/*
|
|
|
|
* pcpu_embed_first_chunk() is used by the generic percpu setup.
|
|
|
|
* Build it if needed by the arch config or the generic setup is going
|
|
|
|
* to be used.
|
|
|
|
*/
|
2009-08-14 14:00:49 +08:00
|
|
|
#if defined(CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK) || \
|
|
|
|
!defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
|
2010-09-10 00:00:15 +08:00
|
|
|
#define BUILD_EMBED_FIRST_CHUNK
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* build pcpu_page_first_chunk() iff needed by the arch config */
|
|
|
|
#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
|
|
|
|
#define BUILD_PAGE_FIRST_CHUNK
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/* pcpu_build_alloc_info() is used by both embed and page first chunk */
|
|
|
|
#if defined(BUILD_EMBED_FIRST_CHUNK) || defined(BUILD_PAGE_FIRST_CHUNK)
|
|
|
|
/**
|
|
|
|
* pcpu_build_alloc_info - build alloc_info considering distances between CPUs
|
|
|
|
* @reserved_size: the size of reserved percpu area in bytes
|
|
|
|
* @dyn_size: minimum free size for dynamic allocation in bytes
|
|
|
|
* @atom_size: allocation atom size
|
|
|
|
* @cpu_distance_fn: callback to determine distance between cpus, optional
|
|
|
|
*
|
|
|
|
* This function determines grouping of units, their mappings to cpus
|
|
|
|
* and other parameters considering needed percpu size, allocation
|
|
|
|
* atom size and distances between CPUs.
|
|
|
|
*
|
2015-03-07 06:30:42 +08:00
|
|
|
* Groups are always multiples of atom size and CPUs which are of
|
2010-09-10 00:00:15 +08:00
|
|
|
* LOCAL_DISTANCE both ways are grouped together and share space for
|
|
|
|
* units in the same group. The returned configuration is guaranteed
|
|
|
|
* to have CPUs on different nodes on different groups and >=75% usage
|
|
|
|
* of allocated virtual address space.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* On success, pointer to the new allocation_info is returned. On
|
|
|
|
* failure, ERR_PTR value is returned.
|
|
|
|
*/
|
2021-02-15 01:16:33 +08:00
|
|
|
static struct pcpu_alloc_info * __init __flatten pcpu_build_alloc_info(
|
2010-09-10 00:00:15 +08:00
|
|
|
size_t reserved_size, size_t dyn_size,
|
|
|
|
size_t atom_size,
|
|
|
|
pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
|
|
|
|
{
|
|
|
|
static int group_map[NR_CPUS] __initdata;
|
|
|
|
static int group_cnt[NR_CPUS] __initdata;
|
2020-10-30 09:38:20 +08:00
|
|
|
static struct cpumask mask __initdata;
|
2010-09-10 00:00:15 +08:00
|
|
|
const size_t static_size = __per_cpu_end - __per_cpu_start;
|
|
|
|
int nr_groups = 1, nr_units = 0;
|
|
|
|
size_t size_sum, min_unit_size, alloc_size;
|
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.
In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:
git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var\(([^\)]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.
No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.
[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/
Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-04 04:09:38 +08:00
|
|
|
int upa, max_upa, best_upa; /* units_per_alloc */
|
2010-09-10 00:00:15 +08:00
|
|
|
int last_allocs, group, unit;
|
|
|
|
unsigned int cpu, tcpu;
|
|
|
|
struct pcpu_alloc_info *ai;
|
|
|
|
unsigned int *cpu_map;
|
|
|
|
|
|
|
|
/* this function may be called multiple times */
|
|
|
|
memset(group_map, 0, sizeof(group_map));
|
|
|
|
memset(group_cnt, 0, sizeof(group_cnt));
|
2020-10-30 09:38:20 +08:00
|
|
|
cpumask_clear(&mask);
|
2010-09-10 00:00:15 +08:00
|
|
|
|
|
|
|
/* calculate size_sum and ensure dyn_size is enough for early alloc */
|
|
|
|
size_sum = PFN_ALIGN(static_size + reserved_size +
|
|
|
|
max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
|
|
|
|
dyn_size = size_sum - static_size - reserved_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine min_unit_size, alloc_size and max_upa such that
|
|
|
|
* alloc_size is multiple of atom_size and is the smallest
|
2011-03-31 09:57:33 +08:00
|
|
|
* which can accommodate 4k aligned segments which are equal to
|
2010-09-10 00:00:15 +08:00
|
|
|
* or larger than min_unit_size.
|
|
|
|
*/
|
|
|
|
min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
|
|
|
|
|
2017-07-16 10:23:09 +08:00
|
|
|
/* determine the maximum # of units that can fit in an allocation */
|
2010-09-10 00:00:15 +08:00
|
|
|
alloc_size = roundup(min_unit_size, atom_size);
|
|
|
|
upa = alloc_size / min_unit_size;
|
2015-11-06 10:46:43 +08:00
|
|
|
while (alloc_size % upa || (offset_in_page(alloc_size / upa)))
|
2010-09-10 00:00:15 +08:00
|
|
|
upa--;
|
|
|
|
max_upa = upa;
|
|
|
|
|
2020-10-30 09:38:20 +08:00
|
|
|
cpumask_copy(&mask, cpu_possible_mask);
|
|
|
|
|
2010-09-10 00:00:15 +08:00
|
|
|
/* group cpus according to their proximity */
|
2020-10-30 09:38:20 +08:00
|
|
|
for (group = 0; !cpumask_empty(&mask); group++) {
|
|
|
|
/* pop the group's first cpu */
|
|
|
|
cpu = cpumask_first(&mask);
|
2010-09-10 00:00:15 +08:00
|
|
|
group_map[cpu] = group;
|
|
|
|
group_cnt[group]++;
|
2020-10-30 09:38:20 +08:00
|
|
|
cpumask_clear_cpu(cpu, &mask);
|
|
|
|
|
|
|
|
for_each_cpu(tcpu, &mask) {
|
|
|
|
if (!cpu_distance_fn ||
|
|
|
|
(cpu_distance_fn(cpu, tcpu) == LOCAL_DISTANCE &&
|
|
|
|
cpu_distance_fn(tcpu, cpu) == LOCAL_DISTANCE)) {
|
|
|
|
group_map[tcpu] = group;
|
|
|
|
group_cnt[group]++;
|
|
|
|
cpumask_clear_cpu(tcpu, &mask);
|
|
|
|
}
|
|
|
|
}
|
2010-09-10 00:00:15 +08:00
|
|
|
}
|
2020-10-30 09:38:20 +08:00
|
|
|
nr_groups = group;
|
2010-09-10 00:00:15 +08:00
|
|
|
|
|
|
|
/*
|
2017-07-16 10:23:09 +08:00
|
|
|
* Wasted space is caused by a ratio imbalance of upa to group_cnt.
|
|
|
|
* Expand the unit_size until we use >= 75% of the units allocated.
|
|
|
|
* Related to atom_size, which could be much larger than the unit_size.
|
2010-09-10 00:00:15 +08:00
|
|
|
*/
|
|
|
|
last_allocs = INT_MAX;
|
2021-06-14 22:42:05 +08:00
|
|
|
best_upa = 0;
|
2010-09-10 00:00:15 +08:00
|
|
|
for (upa = max_upa; upa; upa--) {
|
|
|
|
int allocs = 0, wasted = 0;
|
|
|
|
|
2015-11-06 10:46:43 +08:00
|
|
|
if (alloc_size % upa || (offset_in_page(alloc_size / upa)))
|
2010-09-10 00:00:15 +08:00
|
|
|
continue;
|
|
|
|
|
|
|
|
for (group = 0; group < nr_groups; group++) {
|
|
|
|
int this_allocs = DIV_ROUND_UP(group_cnt[group], upa);
|
|
|
|
allocs += this_allocs;
|
|
|
|
wasted += this_allocs * upa - group_cnt[group];
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't accept if wastage is over 1/3. The
|
|
|
|
* greater-than comparison ensures upa==1 always
|
|
|
|
* passes the following check.
|
|
|
|
*/
|
|
|
|
if (wasted > num_possible_cpus() / 3)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* and then don't consume more memory */
|
|
|
|
if (allocs > last_allocs)
|
|
|
|
break;
|
|
|
|
last_allocs = allocs;
|
|
|
|
best_upa = upa;
|
|
|
|
}
|
2021-06-14 22:42:05 +08:00
|
|
|
BUG_ON(!best_upa);
|
2010-09-10 00:00:15 +08:00
|
|
|
upa = best_upa;
|
|
|
|
|
|
|
|
/* allocate and fill alloc_info */
|
|
|
|
for (group = 0; group < nr_groups; group++)
|
|
|
|
nr_units += roundup(group_cnt[group], upa);
|
|
|
|
|
|
|
|
ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
|
|
|
|
if (!ai)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
|
|
|
cpu_map = ai->groups[0].cpu_map;
|
|
|
|
|
|
|
|
for (group = 0; group < nr_groups; group++) {
|
|
|
|
ai->groups[group].cpu_map = cpu_map;
|
|
|
|
cpu_map += roundup(group_cnt[group], upa);
|
|
|
|
}
|
|
|
|
|
|
|
|
ai->static_size = static_size;
|
|
|
|
ai->reserved_size = reserved_size;
|
|
|
|
ai->dyn_size = dyn_size;
|
|
|
|
ai->unit_size = alloc_size / upa;
|
|
|
|
ai->atom_size = atom_size;
|
|
|
|
ai->alloc_size = alloc_size;
|
|
|
|
|
2019-02-20 21:32:55 +08:00
|
|
|
for (group = 0, unit = 0; group < nr_groups; group++) {
|
2010-09-10 00:00:15 +08:00
|
|
|
struct pcpu_group_info *gi = &ai->groups[group];
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialize base_offset as if all groups are located
|
|
|
|
* back-to-back. The caller should update this to
|
|
|
|
* reflect actual allocation.
|
|
|
|
*/
|
|
|
|
gi->base_offset = unit * ai->unit_size;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
if (group_map[cpu] == group)
|
|
|
|
gi->cpu_map[gi->nr_units++] = cpu;
|
|
|
|
gi->nr_units = roundup(gi->nr_units, upa);
|
|
|
|
unit += gi->nr_units;
|
|
|
|
}
|
|
|
|
BUG_ON(unit != nr_units);
|
|
|
|
|
|
|
|
return ai;
|
|
|
|
}
|
|
|
|
#endif /* BUILD_EMBED_FIRST_CHUNK || BUILD_PAGE_FIRST_CHUNK */
|
|
|
|
|
|
|
|
#if defined(BUILD_EMBED_FIRST_CHUNK)
|
2009-03-10 15:27:48 +08:00
|
|
|
/**
|
|
|
|
* pcpu_embed_first_chunk - embed the first percpu chunk into bootmem
|
|
|
|
* @reserved_size: the size of reserved percpu area in bytes
|
2010-06-28 00:49:59 +08:00
|
|
|
* @dyn_size: minimum free size for dynamic allocation in bytes
|
2009-08-14 14:00:52 +08:00
|
|
|
* @atom_size: allocation atom size
|
|
|
|
* @cpu_distance_fn: callback to determine distance between cpus, optional
|
|
|
|
* @alloc_fn: function to allocate percpu page
|
2011-03-31 09:57:33 +08:00
|
|
|
* @free_fn: function to free percpu page
|
2009-03-10 15:27:48 +08:00
|
|
|
*
|
|
|
|
* This is a helper to ease setting up embedded first percpu chunk and
|
|
|
|
* can be called where pcpu_setup_first_chunk() is expected.
|
|
|
|
*
|
|
|
|
* If this function is used to setup the first chunk, it is allocated
|
2009-08-14 14:00:52 +08:00
|
|
|
* by calling @alloc_fn and used as-is without being mapped into
|
|
|
|
* vmalloc area. Allocations are always whole multiples of @atom_size
|
|
|
|
* aligned to @atom_size.
|
|
|
|
*
|
|
|
|
* This enables the first chunk to piggy back on the linear physical
|
|
|
|
* mapping which often uses larger page size. Please note that this
|
|
|
|
* can result in very sparse cpu->unit mapping on NUMA machines thus
|
|
|
|
* requiring large vmalloc address space. Don't use this allocator if
|
|
|
|
* vmalloc space is not orders of magnitude larger than distances
|
|
|
|
* between node memory addresses (ie. 32bit NUMA machines).
|
2009-03-10 15:27:48 +08:00
|
|
|
*
|
2010-06-28 00:49:59 +08:00
|
|
|
* @dyn_size specifies the minimum dynamic area size.
|
2009-03-10 15:27:48 +08:00
|
|
|
*
|
|
|
|
* If the needed size is smaller than the minimum or specified unit
|
2009-08-14 14:00:52 +08:00
|
|
|
* size, the leftover is returned using @free_fn.
|
2009-03-10 15:27:48 +08:00
|
|
|
*
|
|
|
|
* RETURNS:
|
2009-08-14 14:00:51 +08:00
|
|
|
* 0 on success, -errno on failure.
|
2009-03-10 15:27:48 +08:00
|
|
|
*/
|
2010-06-28 00:49:59 +08:00
|
|
|
int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
|
2009-08-14 14:00:52 +08:00
|
|
|
size_t atom_size,
|
|
|
|
pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
|
|
|
|
pcpu_fc_alloc_fn_t alloc_fn,
|
|
|
|
pcpu_fc_free_fn_t free_fn)
|
2009-03-10 15:27:48 +08:00
|
|
|
{
|
2009-08-14 14:00:52 +08:00
|
|
|
void *base = (void *)ULONG_MAX;
|
|
|
|
void **areas = NULL;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
struct pcpu_alloc_info *ai;
|
2016-10-05 21:19:11 +08:00
|
|
|
size_t size_sum, areas_size;
|
|
|
|
unsigned long max_distance;
|
2019-07-03 16:25:52 +08:00
|
|
|
int group, i, highest_group, rc = 0;
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size,
|
|
|
|
cpu_distance_fn);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
if (IS_ERR(ai))
|
|
|
|
return PTR_ERR(ai);
|
2009-03-10 15:27:48 +08:00
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
|
2009-08-14 14:00:52 +08:00
|
|
|
areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));
|
2009-06-22 10:56:24 +08:00
|
|
|
|
2019-03-12 14:30:42 +08:00
|
|
|
areas = memblock_alloc(areas_size, SMP_CACHE_BYTES);
|
2009-08-14 14:00:52 +08:00
|
|
|
if (!areas) {
|
2009-08-14 14:00:51 +08:00
|
|
|
rc = -ENOMEM;
|
2009-08-14 14:00:52 +08:00
|
|
|
goto out_free;
|
2009-06-22 10:56:24 +08:00
|
|
|
}
|
2009-03-10 15:27:48 +08:00
|
|
|
|
mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
in order to ensure the percpu group areas within a chunk aren't
distributed too sparsely, pcpu_embed_first_chunk() goes to error handling
path when a chunk spans over 3/4 VMALLOC area, however, during the error
handling, it forget to free the memory allocated for all percpu groups by
going to label @out_free other than @out_free_areas.
it will cause memory leakage issue if the rare scene really happens, in
order to fix the issue, we check chunk spanned area immediately after
completing memory allocation for all percpu groups, we go to label
@out_free_areas to free the memory then return if the checking is failed.
in order to verify the approach, we dump all memory allocated then
enforce the jump then dump all memory freed, the result is okay after
checking whether we free all memory we allocate in this function.
BTW, The approach is chosen after thinking over the below scenes
- we don't go to label @out_free directly to fix this issue since we
maybe free several allocated memory blocks twice
- the aim of jumping after pcpu_setup_first_chunk() is bypassing free
usable memory other than handling error, moreover, the function does
not return error code in any case, it either panics due to BUG_ON()
or return 0.
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Tested-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05 21:30:24 +08:00
|
|
|
/* allocate, copy and determine base address & max_distance */
|
|
|
|
highest_group = 0;
|
2009-08-14 14:00:52 +08:00
|
|
|
for (group = 0; group < ai->nr_groups; group++) {
|
|
|
|
struct pcpu_group_info *gi = &ai->groups[group];
|
|
|
|
unsigned int cpu = NR_CPUS;
|
|
|
|
void *ptr;
|
|
|
|
|
|
|
|
for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
|
|
|
|
cpu = gi->cpu_map[i];
|
|
|
|
BUG_ON(cpu == NR_CPUS);
|
|
|
|
|
|
|
|
/* allocate space for the whole group */
|
|
|
|
ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
|
|
|
|
if (!ptr) {
|
|
|
|
rc = -ENOMEM;
|
|
|
|
goto out_free_areas;
|
|
|
|
}
|
2011-09-27 00:12:53 +08:00
|
|
|
/* kmemleak tracks the percpu allocations separately */
|
|
|
|
kmemleak_free(ptr);
|
2009-08-14 14:00:52 +08:00
|
|
|
areas[group] = ptr;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
base = min(ptr, base);
|
mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
in order to ensure the percpu group areas within a chunk aren't
distributed too sparsely, pcpu_embed_first_chunk() goes to error handling
path when a chunk spans over 3/4 VMALLOC area, however, during the error
handling, it forget to free the memory allocated for all percpu groups by
going to label @out_free other than @out_free_areas.
it will cause memory leakage issue if the rare scene really happens, in
order to fix the issue, we check chunk spanned area immediately after
completing memory allocation for all percpu groups, we go to label
@out_free_areas to free the memory then return if the checking is failed.
in order to verify the approach, we dump all memory allocated then
enforce the jump then dump all memory freed, the result is okay after
checking whether we free all memory we allocate in this function.
BTW, The approach is chosen after thinking over the below scenes
- we don't go to label @out_free directly to fix this issue since we
maybe free several allocated memory blocks twice
- the aim of jumping after pcpu_setup_first_chunk() is bypassing free
usable memory other than handling error, moreover, the function does
not return error code in any case, it either panics due to BUG_ON()
or return 0.
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Tested-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05 21:30:24 +08:00
|
|
|
if (ptr > areas[highest_group])
|
|
|
|
highest_group = group;
|
|
|
|
}
|
|
|
|
max_distance = areas[highest_group] - base;
|
|
|
|
max_distance += ai->unit_size * ai->groups[highest_group].nr_units;
|
|
|
|
|
|
|
|
/* warn if maximum distance is further than 75% of vmalloc space */
|
|
|
|
if (max_distance > VMALLOC_TOTAL * 3 / 4) {
|
|
|
|
pr_warn("max_distance=0x%lx too large for vmalloc space 0x%lx\n",
|
|
|
|
max_distance, VMALLOC_TOTAL);
|
|
|
|
#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
|
|
|
|
/* and fail if we have fallback */
|
|
|
|
rc = -EINVAL;
|
|
|
|
goto out_free_areas;
|
|
|
|
#endif
|
2012-04-27 23:42:53 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy data and free unused parts. This should happen after all
|
|
|
|
* allocations are complete; otherwise, we may end up with
|
|
|
|
* overlapping groups.
|
|
|
|
*/
|
|
|
|
for (group = 0; group < ai->nr_groups; group++) {
|
|
|
|
struct pcpu_group_info *gi = &ai->groups[group];
|
|
|
|
void *ptr = areas[group];
|
2009-08-14 14:00:52 +08:00
|
|
|
|
|
|
|
for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) {
|
|
|
|
if (gi->cpu_map[i] == NR_CPUS) {
|
|
|
|
/* unused unit, free whole */
|
|
|
|
free_fn(ptr, ai->unit_size);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* copy and return the unused part */
|
|
|
|
memcpy(ptr, __per_cpu_load, ai->static_size);
|
|
|
|
free_fn(ptr + size_sum, ai->unit_size - size_sum);
|
|
|
|
}
|
2009-06-22 10:56:24 +08:00
|
|
|
}
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
/* base address is now known, determine group base offsets */
|
2009-09-24 17:46:01 +08:00
|
|
|
for (group = 0; group < ai->nr_groups; group++) {
|
2009-08-14 14:00:52 +08:00
|
|
|
ai->groups[group].base_offset = areas[group] - base;
|
2009-09-24 17:46:01 +08:00
|
|
|
}
|
2009-08-14 14:00:52 +08:00
|
|
|
|
2019-03-18 09:32:36 +08:00
|
|
|
pr_info("Embedded %zu pages/cpu s%zu r%zu d%zu u%zu\n",
|
|
|
|
PFN_DOWN(size_sum), ai->static_size, ai->reserved_size,
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
ai->dyn_size, ai->unit_size);
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2019-07-03 16:25:52 +08:00
|
|
|
pcpu_setup_first_chunk(ai, base);
|
2009-08-14 14:00:52 +08:00
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
out_free_areas:
|
|
|
|
for (group = 0; group < ai->nr_groups; group++)
|
2013-09-17 22:57:34 +08:00
|
|
|
if (areas[group])
|
|
|
|
free_fn(areas[group],
|
|
|
|
ai->groups[group].nr_units * ai->unit_size);
|
2009-08-14 14:00:52 +08:00
|
|
|
out_free:
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_free_alloc_info(ai);
|
2009-08-14 14:00:52 +08:00
|
|
|
if (areas)
|
2021-11-06 04:43:22 +08:00
|
|
|
memblock_free(areas, areas_size);
|
2009-08-14 14:00:51 +08:00
|
|
|
return rc;
|
2009-07-04 07:10:59 +08:00
|
|
|
}
|
2010-09-10 00:00:15 +08:00
|
|
|
#endif /* BUILD_EMBED_FIRST_CHUNK */
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2010-09-10 00:00:15 +08:00
|
|
|
#ifdef BUILD_PAGE_FIRST_CHUNK
|
2009-07-04 07:10:59 +08:00
|
|
|
/**
|
2009-08-14 14:00:49 +08:00
|
|
|
* pcpu_page_first_chunk - map the first chunk using PAGE_SIZE pages
|
2009-07-04 07:10:59 +08:00
|
|
|
* @reserved_size: the size of reserved percpu area in bytes
|
|
|
|
* @alloc_fn: function to allocate percpu page, always called with PAGE_SIZE
|
2011-03-31 09:57:33 +08:00
|
|
|
* @free_fn: function to free percpu page, always called with PAGE_SIZE
|
2009-07-04 07:10:59 +08:00
|
|
|
* @populate_pte_fn: function to populate pte
|
|
|
|
*
|
2009-08-14 14:00:49 +08:00
|
|
|
* This is a helper to ease setting up page-remapped first percpu
|
|
|
|
* chunk and can be called where pcpu_setup_first_chunk() is expected.
|
2009-07-04 07:10:59 +08:00
|
|
|
*
|
|
|
|
* This is the basic allocator. Static percpu area is allocated
|
|
|
|
* page-by-page into vmalloc area.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
2009-08-14 14:00:51 +08:00
|
|
|
* 0 on success, -errno on failure.
|
2009-07-04 07:10:59 +08:00
|
|
|
*/
|
2009-08-14 14:00:51 +08:00
|
|
|
int __init pcpu_page_first_chunk(size_t reserved_size,
|
|
|
|
pcpu_fc_alloc_fn_t alloc_fn,
|
|
|
|
pcpu_fc_free_fn_t free_fn,
|
|
|
|
pcpu_fc_populate_pte_fn_t populate_pte_fn)
|
2009-07-04 07:10:59 +08:00
|
|
|
{
|
2009-07-04 07:10:59 +08:00
|
|
|
static struct vm_struct vm;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
struct pcpu_alloc_info *ai;
|
2009-08-14 14:00:49 +08:00
|
|
|
char psize_str[16];
|
2009-07-04 07:11:00 +08:00
|
|
|
int unit_pages;
|
2009-07-04 07:10:59 +08:00
|
|
|
size_t pages_size;
|
2009-07-04 07:11:00 +08:00
|
|
|
struct page **pages;
|
2019-07-03 16:25:52 +08:00
|
|
|
int unit, i, j, rc = 0;
|
2016-12-13 08:45:02 +08:00
|
|
|
int upa;
|
|
|
|
int nr_g0_units;
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2009-08-14 14:00:49 +08:00
|
|
|
snprintf(psize_str, sizeof(psize_str), "%luK", PAGE_SIZE >> 10);
|
|
|
|
|
2010-06-28 00:49:59 +08:00
|
|
|
ai = pcpu_build_alloc_info(reserved_size, 0, PAGE_SIZE, NULL);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
if (IS_ERR(ai))
|
|
|
|
return PTR_ERR(ai);
|
|
|
|
BUG_ON(ai->nr_groups != 1);
|
2016-12-13 08:45:02 +08:00
|
|
|
upa = ai->alloc_size/ai->unit_size;
|
|
|
|
nr_g0_units = roundup(num_possible_cpus(), upa);
|
2018-09-01 03:44:22 +08:00
|
|
|
if (WARN_ON(ai->groups[0].nr_units != nr_g0_units)) {
|
2016-12-13 08:45:02 +08:00
|
|
|
pcpu_free_alloc_info(ai);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
|
|
|
|
unit_pages = ai->unit_size >> PAGE_SHIFT;
|
2009-07-04 07:10:59 +08:00
|
|
|
|
|
|
|
/* unaligned allocations can't be freed, round up to page size */
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
pages_size = PFN_ALIGN(unit_pages * num_possible_cpus() *
|
|
|
|
sizeof(pages[0]));
|
memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment
is implicitly set to SMP_CACHE_BYTES.
Implicit alignment is done deep in the memblock allocator and it can
come as a surprise. Not that such an alignment would be wrong even
when used incorrectly but it is better to be explicit for the sake of
clarity and the prinicple of the least surprise.
Replace all such uses of memblock APIs with the 'align' parameter
explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
in the memblock internal allocation functions.
For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where
appropriate.
The direct memblock APIs users were updated using the semantic patch below:
@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)
[mhocko@suse.com: changelog update]
[akpm@linux-foundation.org: coding-style fixes]
[rppt@linux.ibm.com: fix missed uses of implicit alignment]
Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Paul Burton <paul.burton@mips.com> [MIPS]
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31 06:09:57 +08:00
|
|
|
pages = memblock_alloc(pages_size, SMP_CACHE_BYTES);
|
2019-03-12 14:30:15 +08:00
|
|
|
if (!pages)
|
|
|
|
panic("%s: Failed to allocate %zu bytes\n", __func__,
|
|
|
|
pages_size);
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2009-07-04 07:10:59 +08:00
|
|
|
/* allocate pages */
|
2009-07-04 07:10:59 +08:00
|
|
|
j = 0;
|
2016-12-13 08:45:02 +08:00
|
|
|
for (unit = 0; unit < num_possible_cpus(); unit++) {
|
|
|
|
unsigned int cpu = ai->groups[0].cpu_map[unit];
|
2009-07-04 07:11:00 +08:00
|
|
|
for (i = 0; i < unit_pages; i++) {
|
2009-07-04 07:10:59 +08:00
|
|
|
void *ptr;
|
|
|
|
|
2009-08-14 14:00:50 +08:00
|
|
|
ptr = alloc_fn(cpu, PAGE_SIZE, PAGE_SIZE);
|
2009-07-04 07:10:59 +08:00
|
|
|
if (!ptr) {
|
2016-03-18 05:19:53 +08:00
|
|
|
pr_warn("failed to allocate %s page for cpu%u\n",
|
2016-12-13 08:45:02 +08:00
|
|
|
psize_str, cpu);
|
2009-07-04 07:10:59 +08:00
|
|
|
goto enomem;
|
|
|
|
}
|
2011-09-27 00:12:53 +08:00
|
|
|
/* kmemleak tracks the percpu allocations separately */
|
|
|
|
kmemleak_free(ptr);
|
2009-07-04 07:11:00 +08:00
|
|
|
pages[j++] = virt_to_page(ptr);
|
2009-07-04 07:10:59 +08:00
|
|
|
}
|
2016-12-13 08:45:02 +08:00
|
|
|
}
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2009-07-04 07:10:59 +08:00
|
|
|
/* allocate vm area, map the pages and copy static data */
|
|
|
|
vm.flags = VM_ALLOC;
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
vm.size = num_possible_cpus() * ai->unit_size;
|
2009-07-04 07:10:59 +08:00
|
|
|
vm_area_register_early(&vm, PAGE_SIZE);
|
|
|
|
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
for (unit = 0; unit < num_possible_cpus(); unit++) {
|
2009-08-14 14:00:50 +08:00
|
|
|
unsigned long unit_addr =
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
(unsigned long)vm.addr + unit * ai->unit_size;
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2009-07-04 07:11:00 +08:00
|
|
|
for (i = 0; i < unit_pages; i++)
|
2009-07-04 07:10:59 +08:00
|
|
|
populate_pte_fn(unit_addr + (i << PAGE_SHIFT));
|
|
|
|
|
|
|
|
/* pte already populated, the following shouldn't fail */
|
2009-08-14 14:00:51 +08:00
|
|
|
rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
|
|
|
|
unit_pages);
|
|
|
|
if (rc < 0)
|
|
|
|
panic("failed to map percpu area, err=%d\n", rc);
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-07-04 07:10:59 +08:00
|
|
|
/*
|
|
|
|
* FIXME: Archs with virtual cache should flush local
|
|
|
|
* cache for the linear mapping here - something
|
|
|
|
* equivalent to flush_cache_vmap() on the local cpu.
|
|
|
|
* flush_cache_vmap() can't be used as most supporting
|
|
|
|
* data structures are not set up yet.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* copy static data */
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
memcpy((void *)unit_addr, __per_cpu_load, ai->static_size);
|
2009-03-10 15:27:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* we're ready, commit */
|
2019-03-18 09:32:36 +08:00
|
|
|
pr_info("%d %s pages/cpu s%zu r%zu d%zu\n",
|
|
|
|
unit_pages, psize_str, ai->static_size,
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
ai->reserved_size, ai->dyn_size);
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2019-07-03 16:25:52 +08:00
|
|
|
pcpu_setup_first_chunk(ai, vm.addr);
|
2009-07-04 07:10:59 +08:00
|
|
|
goto out_free_ar;
|
|
|
|
|
|
|
|
enomem:
|
|
|
|
while (--j >= 0)
|
2009-07-04 07:11:00 +08:00
|
|
|
free_fn(page_address(pages[j]), PAGE_SIZE);
|
2009-08-14 14:00:51 +08:00
|
|
|
rc = -ENOMEM;
|
2009-07-04 07:10:59 +08:00
|
|
|
out_free_ar:
|
2021-11-06 04:43:22 +08:00
|
|
|
memblock_free(pages, pages_size);
|
percpu: introduce pcpu_alloc_info and pcpu_group_info
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
2009-08-14 14:00:51 +08:00
|
|
|
pcpu_free_alloc_info(ai);
|
2009-08-14 14:00:51 +08:00
|
|
|
return rc;
|
2009-07-04 07:10:59 +08:00
|
|
|
}
|
2010-09-10 00:00:15 +08:00
|
|
|
#endif /* BUILD_PAGE_FIRST_CHUNK */
|
2009-07-04 07:10:59 +08:00
|
|
|
|
2010-09-04 00:22:48 +08:00
|
|
|
#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
|
2009-03-30 18:07:44 +08:00
|
|
|
/*
|
2010-09-04 00:22:48 +08:00
|
|
|
* Generic SMP percpu area setup.
|
2009-03-30 18:07:44 +08:00
|
|
|
*
|
|
|
|
* The embedding helper is used because its behavior closely resembles
|
|
|
|
* the original non-dynamic generic percpu area setup. This is
|
|
|
|
* important because many archs have addressing restrictions and might
|
|
|
|
* fail if the percpu area is located far away from the previous
|
|
|
|
* location. As an added bonus, in non-NUMA cases, embedding is
|
|
|
|
* generally a good idea TLB-wise because percpu area can piggy back
|
|
|
|
* on the physical linear memory mapping which uses large page
|
|
|
|
* mappings on applicable archs.
|
|
|
|
*/
|
|
|
|
unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
|
|
|
|
EXPORT_SYMBOL(__per_cpu_offset);
|
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
|
|
|
|
size_t align)
|
|
|
|
{
|
2019-03-12 14:30:42 +08:00
|
|
|
return memblock_alloc_from(size, align, __pa(MAX_DMA_ADDRESS));
|
2009-08-14 14:00:52 +08:00
|
|
|
}
|
2009-03-10 15:27:48 +08:00
|
|
|
|
2009-08-14 14:00:52 +08:00
|
|
|
static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
|
|
|
|
{
|
2021-11-06 04:43:22 +08:00
|
|
|
memblock_free(ptr, size);
|
2009-08-14 14:00:52 +08:00
|
|
|
}
|
|
|
|
|
2009-03-30 18:07:44 +08:00
|
|
|
void __init setup_per_cpu_areas(void)
|
|
|
|
{
|
|
|
|
unsigned long delta;
|
|
|
|
unsigned int cpu;
|
2009-08-14 14:00:51 +08:00
|
|
|
int rc;
|
2009-03-30 18:07:44 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Always reserve area for module percpu variables. That's
|
|
|
|
* what the legacy allocator did.
|
|
|
|
*/
|
2009-08-14 14:00:51 +08:00
|
|
|
rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
|
2009-08-14 14:00:52 +08:00
|
|
|
PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
|
|
|
|
pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
|
2009-08-14 14:00:51 +08:00
|
|
|
if (rc < 0)
|
2010-09-04 00:22:48 +08:00
|
|
|
panic("Failed to initialize percpu areas.");
|
2009-03-30 18:07:44 +08:00
|
|
|
|
|
|
|
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
|
|
|
|
for_each_possible_cpu(cpu)
|
2009-08-14 14:00:51 +08:00
|
|
|
__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
|
2009-03-10 15:27:48 +08:00
|
|
|
}
|
2010-09-04 00:22:48 +08:00
|
|
|
#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
|
|
|
|
|
|
|
|
#else /* CONFIG_SMP */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* UP percpu area setup.
|
|
|
|
*
|
|
|
|
* UP always uses km-based percpu allocator with identity mapping.
|
|
|
|
* Static percpu variables are indistinguishable from the usual static
|
|
|
|
* variables and don't require any special preparation.
|
|
|
|
*/
|
|
|
|
void __init setup_per_cpu_areas(void)
|
|
|
|
{
|
|
|
|
const size_t unit_size =
|
|
|
|
roundup_pow_of_two(max_t(size_t, PCPU_MIN_UNIT_SIZE,
|
|
|
|
PERCPU_DYNAMIC_RESERVE));
|
|
|
|
struct pcpu_alloc_info *ai;
|
|
|
|
void *fc;
|
|
|
|
|
|
|
|
ai = pcpu_alloc_alloc_info(1, 1);
|
2019-03-12 14:30:42 +08:00
|
|
|
fc = memblock_alloc_from(unit_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
|
2010-09-04 00:22:48 +08:00
|
|
|
if (!ai || !fc)
|
|
|
|
panic("Failed to allocate memory for percpu areas.");
|
2012-05-09 23:55:19 +08:00
|
|
|
/* kmemleak tracks the percpu allocations separately */
|
|
|
|
kmemleak_free(fc);
|
2010-09-04 00:22:48 +08:00
|
|
|
|
|
|
|
ai->dyn_size = unit_size;
|
|
|
|
ai->unit_size = unit_size;
|
|
|
|
ai->atom_size = unit_size;
|
|
|
|
ai->alloc_size = unit_size;
|
|
|
|
ai->groups[0].nr_units = 1;
|
|
|
|
ai->groups[0].cpu_map[0] = 0;
|
|
|
|
|
2019-07-03 16:25:52 +08:00
|
|
|
pcpu_setup_first_chunk(ai, fc);
|
2017-10-04 06:29:49 +08:00
|
|
|
pcpu_free_alloc_info(ai);
|
2010-09-04 00:22:48 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_SMP */
|
2010-06-28 00:50:00 +08:00
|
|
|
|
2018-08-22 12:53:58 +08:00
|
|
|
/*
|
|
|
|
* pcpu_nr_pages - calculate total number of populated backing pages
|
|
|
|
*
|
|
|
|
* This reflects the number of pages populated to back chunks. Metadata is
|
|
|
|
* excluded in the number exposed in meminfo as the number of backing pages
|
|
|
|
* scales with the number of cpus and can quickly outweigh the memory used for
|
|
|
|
* metadata. It also keeps this calculation nice and simple.
|
|
|
|
*
|
|
|
|
* RETURNS:
|
|
|
|
* Total number of populated backing pages in use by the allocator.
|
|
|
|
*/
|
|
|
|
unsigned long pcpu_nr_pages(void)
|
|
|
|
{
|
|
|
|
return pcpu_nr_populated * pcpu_nr_units;
|
|
|
|
}
|
|
|
|
|
2014-09-03 02:46:05 +08:00
|
|
|
/*
|
|
|
|
* Percpu allocator is initialized early during boot when neither slab or
|
|
|
|
* workqueue is available. Plug async management until everything is up
|
|
|
|
* and running.
|
|
|
|
*/
|
|
|
|
static int __init percpu_enable_async(void)
|
|
|
|
{
|
|
|
|
pcpu_async_enabled = true;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
subsys_initcall(percpu_enable_async);
|