Commit Graph

20 Commits

Author SHA1 Message Date
Vitaly Wool ca0246bb97 z3fold: fix possible reclaim races
Reclaim and free can race on an object which is basically fine but in
order for reclaim to be able to map "freed" object we need to encode
object length in the handle.  handle_to_chunks() is then introduced to
extract object length from a handle and use it during mapping.

Moreover, to avoid racing on a z3fold "headless" page release, we should
not try to free that page in z3fold_free() if the reclaim bit is set.
Also, in the unlikely case of trying to reclaim a page being freed, we
should not proceed with that page.

While at it, fix the page accounting in reclaim function.

This patch supersedes "[PATCH] z3fold: fix reclaim lock-ups".

Link: http://lkml.kernel.org/r/20181105162225.74e8837d03583a9b707cf559@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Signed-off-by: Jongseok Kim <ks77sj@gmail.com>
Reported-by-by: Jongseok Kim <ks77sj@gmail.com>
Reviewed-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-11-18 10:15:09 -08:00
Vitaly Wool 6098d7e136 z3fold: fix reclaim lock-ups
Do not try to optimize in-page object layout while the page is under
reclaim.  This fixes lock-ups on reclaim and improves reclaim
performance at the same time.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180430125800.444cae9706489f412ad12621@gmail.com
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Cc: <Oleksiy.Avramchenko@sony.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-11 17:28:45 -07:00
Matthew Wilcox 8a97ea546b mm/z3fold.c: use gfpflags_allow_blocking
We have a perfectly good macro to determine whether the gfp flags allow
you to sleep or not; use it instead of trying to infer it.

Link: http://lkml.kernel.org/r/20180408062206.GC16007@bombadil.infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Xidong Wang 1ec6995d12 z3fold: fix memory leak
In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
released on the error path that pool->compact_wq , which holds the
return value of create_singlethread_workqueue(), is NULL.  This will
result in a memory leak bug.

[akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
Signed-off-by: Xidong Wang <wangxidong_97@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Vitaly Wool 5c9bab592f z3fold: limit use of stale list for allocation
Currently if z3fold couldn't find an unbuddied page it would first try
to pull a page off the stale list.  The problem with this approach is
that we can't 100% guarantee that the page is not processed by the
workqueue thread at the same time unless we run cancel_work_sync() on
it, which we can't do if we're in an atomic context.  So let's just
limit stale list usage to non-atomic contexts only.

Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com
Signed-off-by: Vitaly Vul <vitaly.vul@sony.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: <Oleksiy.Avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:25 -07:00
Mike Rapoport f144c390f9 mm: docs: fix parameter names mismatch
There are several places where parameter descriptions do no match the
actual code.  Fix it.

Link: http://lkml.kernel.org/r/1516700871-22279-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:48 -08:00
Vitaly Wool 5d03a66139 mm/z3fold.c: use kref to prevent page free/compact race
There is a race in the current z3fold implementation between
do_compact() called in a work queue context and the page release
procedure when page's kref goes to 0.

do_compact() may be waiting for page lock, which is released by
release_z3fold_page_locked right before putting the page onto the
"stale" list, and then the page may be freed as do_compact() modifies
its contents.

The mechanism currently implemented to handle that (checking the
PAGE_STALE flag) is not reliable enough.  Instead, we'll use page's kref
counter to guarantee that the page is not released if its compaction is
scheduled.  It then becomes compaction function's responsibility to
decrease the counter and quit immediately if the page was actually
freed.

Link: http://lkml.kernel.org/r/20171117092032.00ea56f42affbed19f4fcc6c@gmail.com
Signed-off-by: Vitaly Wool <vitaly.wool@sonymobile.com>
Cc: <Oleksiy.Avramchenko@sony.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:00 -08:00
Vitaly Wool 3552935742 z3fold: fix stale list handling
Fix the situation when clear_bit() is called for page->private before
the page pointer is actually assigned.  While at it, remove work_busy()
check because it is costly and does not give 100% guarantee anyway.

Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: <Oleksiy.Avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:24 -07:00
Vitaly Wool d5567c9df1 z3fold: fix potential race in z3fold_reclaim_page
It is possible that on a (partially) unsuccessful page reclaim,
kref_put() called in z3fold_reclaim_page() does not yield page release,
but the page is released shortly afterwards by another thread.  Then
z3fold_reclaim_page() would try to list_add() that (released) page again
which is obviously a bug.

To avoid that, spin_lock() has to be taken earlier, before the
kref_put() call mentioned earlier.

Link: http://lkml.kernel.org/r/20170913162937.bfff21c7d12b12a5f47639fd@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: <Oleksiy.Avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:24 -07:00
Vitaly Wool d30561c56f z3fold: use per-cpu unbuddied lists
It's been noted that z3fold doesn't scale well when it's run in a large
number of threads on many cores, which can be easily reproduced with fio
'randrw' test with --numjobs=32.  E.g.  the result for 1 cluster (4 cores)
is:

Run status group 0 (all jobs):
   READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
  WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...

While for 8 cores (2 clusters) the result is:

Run status group 0 (all jobs):
   READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
  WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...

The bottleneck here is the pool lock which many threads become waiting
upon.  To reduce that spin lock contention, z3fold can operate only on
the lists local to the current CPU whenever possible.  Due to the nature
of z3fold unbuddied list handling (it only takes the first entry off the
list on a hot path), if the z3fold pool is big enough and balanced well
enough, limiting search to only local unbuddied list doesn't lead to a
significant compression ratio degrade (2.57x vs 2.65x in our
measurements).

This patch also introduces two worker threads: one for async in-page
object layout optimization and one for releasing freed pages.  This is
done to speed up z3fold_free() which is often on a hot path.

The fio results for 8-core case are now the following:

Run status group 0 (all jobs):
   READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
  WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...

So we're in for almost 6x performance increase.

Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Vitaly Wool 76e32a2a08 z3fold: fix page locking in z3fold_alloc()
Stress testing of the current z3fold implementation on a 8-core system
revealed it was possible that a z3fold page deleted from its unbuddied
list in z3fold_alloc() would be put on another unbuddied list by
z3fold_free() while z3fold_alloc() is still processing it.  This has
been introduced with commit 5a27aa822 ("z3fold: add kref refcounting")
due to the removal of special handling of a z3fold page not on any list
in z3fold_free().

To fix this, the z3fold page lock should be taken in z3fold_alloc()
before the pool lock is released.  To avoid deadlocking, we just try to
lock the page as soon as we get a hold of it, and if trylock fails, we
drop this page and take the next one.

Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: <Oleksiy.Avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-13 18:24:20 -07:00
Vitaly Wool 271df90e4e z3fold: fix spinlock unlocking in page reclaim
Commmit 5a27aa8220 ("z3fold: add kref refcounting") introduced a bug
in z3fold_reclaim_page() with function exit that may leave pool->lock
spinlock held.  Here comes the trivial fix.

Fixes: 5a27aa8220 ("z3fold: add kref refcounting")
Link: http://lkml.kernel.org/r/20170311222239.7b83d8e7ef1914e05497649f@gmail.com
Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-16 16:56:18 -07:00
Vitaly Wool 5a27aa8220 z3fold: add kref refcounting
With both coming and already present locking optimizations, introducing
kref to reference-count z3fold objects is the right thing to do.
Moreover, it makes buddied list no longer necessary, and allows for a
simpler handling of headless pages.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170131214650.8ea78033d91ded233f552bc0@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Vitaly Wool 2f1e5e4d84 z3fold: use per-page spinlock
Most of z3fold operations are in-page, such as modifying z3fold page
header or moving z3fold objects within a page.  Taking per-pool spinlock
to protect per-page objects is therefore suboptimal, and the idea of
having a per-page spinlock (or rwlock) has been around for some time.

This patch implements spinlock-based per-page locking mechanism which is
lightweight enough to normally fit ok into the z3fold header.

Link: http://lkml.kernel.org/r/20170131214438.433e0a5fda908337b63206d3@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Vitaly Wool 1b096e5ae9 z3fold: extend compaction function
z3fold_compact_page() currently only handles the situation when there's
a single middle chunk within the z3fold page.  However it may be worth
it to move middle chunk closer to either first or last chunk, whichever
is there, if the gap between them is big enough.

This patch adds the relevant code, using BIG_CHUNK_GAP define as a
threshold for middle chunk to be worth moving.

Link: http://lkml.kernel.org/r/20170131214334.c4f3eac9a477af0fa9a22c46@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Vitaly Wool ede93213aa z3fold: fix header size related issues
Currently the whole kernel build will be stopped if the size of struct
z3fold_header is greater than the size of one chunk, which is 64 bytes
by default.  This patch instead defines the offset for z3fold objects as
the size of the z3fold header in chunks.

Fixed also are the calculation of num_free_chunks() and the address to
move the middle chunk to in case of in-page compaction in
z3fold_compact_page().

Link: http://lkml.kernel.org/r/20170131214057.d98677032bc7b1c6c59a80c9@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
Vitaly Wool 12d59ae678 z3fold: make pages_nr atomic
Convert pages_nr per-pool counter to atomic64_t.

Link: http://lkml.kernel.org/r/20170131213946.b828676ab17bbea42022c213@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
zhong jiang f201ebd876 mm/z3fold.c: limit first_num to the actual range of possible buddy indexes
At present, Tying the first_num size to NCHUNKS_ORDER is confusing.  the
number of chunks is completely unrelated to the number of buddies.

The patch limits the first_num to actual range of possible buddy indexes.
and that is more reasonable and obvious without functional change.

Link: http://lkml.kernel.org/r/1476776569-29504-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:31 -08:00
Vitaly Wool 43afc19417 mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup
Fix erroneous z3fold header access in a HEADLESS page in reclaim
function, and change one remaining direct handle-to-buddy conversion to
use the appropriate helper.

Link: http://lkml.kernel.org/r/5748706F.9020208@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:55 -07:00
Vitaly Wool 9a001fc19c z3fold: the 3-fold allocator for compressed pages
This patch introduces z3fold, a special purpose allocator for storing
compressed pages.  It is designed to store up to three compressed pages
per physical page.  It is a ZBUD derivative which allows for higher
compression ratio keeping the simplicity and determinism of its
predecessor.

This patch comes as a follow-up to the discussions at the Embedded Linux
Conference in San-Diego related to the talk [1].  The outcome of these
discussions was that it would be good to have a compressed page
allocator as stable and deterministic as zbud with with higher
compression ratio.

To keep the determinism and simplicity, z3fold, just like zbud, always
stores an integral number of compressed pages per page, but it can store
up to 3 pages unlike zbud which can store at most 2.  Therefore the
compression ratio goes to around 2.6x while zbud's one is around 1.7x.

The patch is based on the latest linux.git tree.

This version has been updated after testing on various simulators (e.g.
ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
comments from Dan Streetman [3].

[1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
[2] https://lkml.org/lkml/2016/4/21/799
[3] https://lkml.org/lkml/2016/5/4/852

Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00