The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem. It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn. It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.
Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts. The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:
Node 0: +++++ +++++
Node 1: +++++
Previous behavior before Mike's patch was to assign nodes like this:
Node 0: 00000 XXXXX
Node 1: 11111
Where the 'X' areas were simply thrown away. The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0: 000000000000000 Node 1: 11111
This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory. memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages. However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:
struct page *start = pfn_to_page(start_pfn);
// effectively start = &node->node_mem_map[0]
for (page = start; page < (start + size); page++) {
init_page_here();...
page++;
}
Slow, and wasteful, but generally harmless.
But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):
for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
page = pfn_to_page(pfn);
}
And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0. This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.
A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.
Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.
Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.
In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.
One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.
This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Martin Bligh <mbligh@aracnet.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is some confusion that arose when working on SPARSEMEM patch between
what is needed for DISCONTIG vs. NUMA.
Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
All of the current NUMA implementations require an implementation of
DISCONTIG. Because of this, quite a lot of code which is really needed for
NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
case.
Introducing this new NEED_MULTIPLE_NODES config option allows code that is
needed for both NUMA or DISCONTIG to be separated out from code that is
specific to DISCONTIG.
One great advantage of this approach is that it doesn't require every
architecture to be converted over. All of the current implementations
should "just work", only the ones implementing SPARSEMEM will have to be
fixed up.
The change to free_area_init() makes it work inside, or out of the new
config option.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Generify the value fields in the page_flags. The aim is to allow the location
and size of these fields to be varied. Additionally we want to move away from
fixed allocations per field whilst still enforcing the overall bit utilisation
limits. We rely on the compiler to spot and optimise the accessor functions.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a simple allocator for the NUMA remap space. This space is very
scarce, used for structures which are best allocated node local.
This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.
Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
to allow us to allocate the pgdat's using this mechanism; which we do not yet
do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot_pageset needs to be preserved for hotplugging and for off line
processors and nodes. Otherwise pointers will point into memory that has
now a different use. /proc/zoneinfo is currently showing strange results
if processors / nodes are not present.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
ought to clear them to avoid repetitive reports (Nikita noticed this too).
Let prep_new_page check page_count and PG_slab as free_pages_check does.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Reduce size of the huge per_cpu_pageset structure in __initdata introduced
into mm1 with the pageset localization patchset. Use one specially
configured pageset per cpu for all zones and nodes during bootup.
- Avoid duplication of pageset initialization code.
- do the adding to the pageset list before potential free_pages_bulk
in free_hot_cold_page (otherwise we would have to hold a page
in a pageset during the period that the boot pagesets are in use).
- remove mistaken __cpuinitdata attribute and revert back to __initdata
for the boot pageset. A boot pageset is not necessary for cpu hotplug.
Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
sparsemem patches)
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The pageset array can potentially acquire a huge amount of memory on large
NUMA systems. F.e. on a system with 512 processors and 256 nodes there
will be 256*512 pagesets. If each pageset only holds 5 pages then we are
talking about 655360 pages.With a 16K page size on IA64 this results in
potentially 10 Gigabytes of memory being trapped in pagesets. The typical
cases are much less for smaller systems but there is still the potential of
memory being trapped in off node pagesets. Off node memory may be rarely
used if local memory is available and so we may potentially have memory in
seldom used pagesets without this patch.
The slab allocator flushes its per cpu caches every 2 seconds. The
following patch flushes the off node pageset caches in the same way by
tying into the slab flush.
The patch also changes /proc/zoneinfo to include the number of pages
currently in each pageset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch provides more debug info when the system is OOM. It displays
memory stats (basically sysrq-m info) from __alloc_pages() when page
allocation fails and during OOM kill.
Thanks to Dave Jones for coming up with the idea.
Signed-off-by: Janet Morgan <janetmor@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
By making the offset argument of __read_page_state an unsigned long instead of
unsigned, we can avoid forcing the compiler to sign extend a usually constant
argument. This saves 1 instruction on x86-64.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
By making the offset argument of __mod_page_state an unsigned long instead
of unsigned, we can avoid forcing the compiler to sign extend a usually
constant argument. This saves 1 instruction on x86-64.
Signed-off-by: Benjamin LaHaise <benjamin.c.lahaise@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
try_to_free_pages accepts a third argument, order, but hasn't used it since
before 2.6.0. The following patch removes the argument and updates all the
calls to try_to_free_pages.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch modifies the way pagesets in struct zone are managed.
Each zone has a per-cpu array of pagesets. So any particular CPU has some
memory in each zone structure which belongs to itself. Even if that CPU is
not local to that zone.
So the patch relocates the pagesets for each cpu to the node that is nearest
to the cpu instead of allocating the pagesets in the (possibly remote) target
zone. This means that the operations to manage pages on remote zone can be
done with information available locally.
We play a macro trick so that non-NUMA pmachines avoid the additional
pointer chase on the page allocator fastpath.
AIM7 benchmark on a 32 CPU SGI Altix
w/o patches:
Tasks jobs/min jti jobs/min/task real cpu
1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005
100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005
200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005
300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005
400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005
500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005
600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005
700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005
800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005
900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005
1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005
with slab API changes and pageset patch:
Tasks jobs/min jti jobs/min/task real cpu
1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005
100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005
200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005
300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005
400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005
500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005
600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005
700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005
800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005
900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005
1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com>
Signed-off-by: Shai Fultheim <Shai@Scalex86.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When early zone reclaim is turned on the LRU is scanned more frequently when a
zone is low on memory. This limits when the zone reclaim can be called by
skipping the scan if another thread (either via kswapd or sync reclaim) is
already reclaiming from the zone.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When using the early zone reclaim, it was noticed that allocating new pages
that should be spread across the whole system caused eviction of local pages.
This adds a new GFP flag to prevent early reclaim from happening during
certain allocation attempts. The example that is implemented here is for page
cache pages. We want page cache pages to be spread across the whole system,
and we don't want page cache pages to evict other pages to get local memory.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is the core of the (much simplified) early reclaim. The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.
One of the major uses of this is NUMA machines. With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.
This adds a zone tuneable to enable early zone reclaim. It is selected on a
per-zone basis and is turned on/off via syscall.
Adding some extra throttling on the reclaim was also required (patch
4/4). Without the machine would grind to a crawl when doing a "make -j"
kernel build. Even with this patch the System Time is higher on
average, but it seems tolerable. Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
wall user sys %cpu ctx sw. sleeps
---- ---- --- ---- ------ ------
No patch 1009 1384 847 258 298170 504402
w/patch, no reclaim 880 1376 667 288 254064 396745
w/patch & reclaim 1079 1385 926 252 291625 548873
These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot. Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.
I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.
Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).
The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add /proc/zoneinfo file to display information about memory zones. Useful
to analyze VM behaviour.
Signed-off-by: Nikita Danilov <nikita@clusterfs.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Export node_online_map and node_possible_map so that kernel modules can use
the nodemask macros, like, for_each_node() and for_each_online_node().
Signed-off-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
I have recompiled Linux kernel 2.6.11.5 documentation for me and our
university students again. The documentation could be extended for more
sources which are equipped by structured comments for recent 2.6 kernels. I
have tried to proceed with that task. I have done that more times from 2.6.0
time and it gets boring to do same changes again and again. Linux kernel
compiles after changes for i386 and ARM targets. I have added references to
some more files into kernel-api book, I have added some section names as well.
So please, check that changes do not break something and that categories are
not too much skewed.
I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
by kernel convention. Most of the other changes are modifications in the
comments to make kernel-doc happy, accept some parameters description and do
not bail out on errors. Changed <pid> to @pid in the description, moved some
#ifdef before comments to correct function to comments bindings, etc.
You can see result of the modified documentation build at
http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz
Some more sources are ready to be included into kernel-doc generated
documentation. Sources has been added into kernel-api for now. Some more
section names added and probably some more chaos introduced as result of quick
cleanup work.
Signed-off-by: Pavel Pisa <pisa@cmp.felk.cvut.cz>
Signed-off-by: Martin Waitz <tali@admingilde.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is a patch for counting the number of pages for bounce buffers. It's
shown in /proc/vmstat.
Currently, the number of bounce pages are not counted anywhere. So, if
there are many bounce pages, it seems that there are leaked pages. And
it's difficult for a user to imagine the usage of bounce pages. So, it's
meaningful to show # of bouce pages.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mempools have 2 problems.
The first is that mempool_alloc can possibly get stuck in __alloc_pages
when they should opt to fail, and take an element from their reserved pool.
The second is that it will happily eat emergency PF_MEMALLOC reserves
instead of going to their reserved pools.
Fix the first by passing __GFP_NORETRY in the allocation calls in
mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag which
directs the page allocator not to allocate from the reserve pool.
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Jack Steiner reported this to have fixed his problem (bad colouring):
"The patches fix both problems that I found - bad
coloring & excessive pages in pagesets."
In most workloads this is not likely to be such a pronounced problem,
however it should help corner cases. And avoiding powers of 2 in these
types of memory operations is always a good idea.
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!