With the new ->readahead operation, locked pages are added to the page
cache, preventing two threads from racing with each other to read the
same chunk of file, so this is dead code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Just clean code, no logical change.
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
F2FS inode may have different max size, e.g. compressed file have
less blkaddr entries in all its direct-node blocks, result in being
with less max filesize. So change to use per-inode maxbytes.
Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
generic/269 reports a hangtask issue, the root cause is ABBA deadlock
described as below:
Thread A Thread B
- down_write(&sbi->gc_lock) -- A
- f2fs_write_data_pages
- lock all pages in cluster -- B
- f2fs_write_multi_pages
- f2fs_write_raw_pages
- f2fs_write_single_data_page
- f2fs_balance_fs
- down_write(&sbi->gc_lock) -- A
- f2fs_gc
- do_garbage_collect
- ra_data_block
- pagecache_get_page -- B
To fix this, it needs to avoid calling f2fs_balance_fs() if there is
still cluster pages been locked in context of cluster writeback, so
instead, let's call f2fs_balance_fs() in the end of
f2fs_write_raw_pages() when all cluster pages were unlocked.
Fixes: 4c8ff7095b ("f2fs: support data compression")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Rework the post-read processing logic to be much easier to understand.
At least one bug is fixed by this: if an I/O error occurred when reading
from disk, decryption and verity would be performed on the uninitialized
data, causing misleading messages in the kernel log.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add a new directory 'stat' in path of /sys/fs/f2fs/<devname>/, later
we can add new readonly stat sysfs file into this directory, it will
make <devname> directory less mess.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Expand 'compress_algorithm' mount option to accept parameter as format of
<algorithm>:<level>, by this way, it gives a way to allow user to do more
specified config on lz4 and zstd compression level, then f2fs compression
can provide higher compress ratio.
In order to set compress level for lz4 algorithm, it needs to set
CONFIG_LZ4HC_COMPRESS and CONFIG_F2FS_FS_LZ4HC config to enable lz4hc
compress algorithm.
CR and performance number on lz4/lz4hc algorithm:
dd if=enwik9 of=compressed_file conv=fsync
Original blocks: 244382
lz4 lz4hc-9
compressed blocks 170647 163270
compress ratio 69.8% 66.8%
speed 16.4207 s, 60.9 MB/s 26.7299 s, 37.4 MB/s
compress ratio = after / before
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we've made more work into per-file compression support. For
example, F2FS_IOC_GET|SET_COMPRESS_OPTION provides a way to change the
algorithm or cluster size per file. F2FS_IOC_COMPRESS|DECOMPRESS_FILE provides
a way to compress and decompress the existing normal files manually along with
a new mount option, compress_mode=fs|user, which can control who compresses the
data. Chao also added a checksum feature with a mount option so that we are able
to detect any corrupted cluster. In addition, Daniel contributed casefolding
with encryption patch, which will be used for Android devices.
Enhancement:
- add ioctls and mount option to manage per-file compression feature
- support casefolding with encryption
- support checksum for compressed cluster
- avoid IO starvation by replacing mutex with rwsem
- add sysfs, max_io_bytes, to control max bio size
Bug fix:
- fix use-after-free issue when compression and fsverity are enabled
- fix consistency corruption during fault injection test
- fix data offset for lseek
- get rid of buffer_head which has 32bits limit in fiemap
- fix some bugs in multi-partitions support
- fix nat entry count calculation in shrinker
- fix some stat information
And, we've refactored some logics and fix minor bugs as well.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl/a8ywACgkQQBSofoJI
UNLa2RAAjK+6tOs+NuYx2w9SegghKxwCg4Mb362BMdaAGx6GzMqAkCiVdujuoz/r
+wy8sdqO9QE7723ZDNsebNMLRnkNPHnpneSL2p6OsSLJrD3ORTELVRrzNlkemvnK
rRHZyYnNJvQQnD4uU7ABvROKsIDw/nCfcFvzHmLIgEw8EHO0W4n6fTtBdTwXv1qi
N3qXhGuQldonR9XICuGjzj7wh17n9ua6Mr12XX3Ok38giMcZb9KFBwgvlhl35cxt
htEmUpxWD3NTSw6zJmV4VAiajpiIkW6QRQuVA1nzdLZK644gaJMhM1EUsOnZhfDl
wX0ZtKoNkXxb0glD34O3aYqeHJ3tHWgPmmpVm9TECJP9A/X7kmEHgQYpH/eJ9I7d
tk51Uz28Mz1RShXU4i5RyKZeeoNTLiVlqiC95E2cnq4C1tLOJyI00N9AinrLzvR+
fqUrAwCrBpiYX63mWKYwq7GWxWwp4+PY09kyIZxxJiWhTE/St0bRx2bQL8zA8C6J
Rtxl+QWyQhkFbNu8fAukLFAhC6mqX/FKpXvUqRehBnHRvMWBiVZG0//eOPQLk71u
qsdCgYuEVcg3itDQrZvmsjxi4Pb5E9mNr0s5oC4I2WvBPMheD4esSyG7cKDN0qfS
3FFHlRYLOvnjPMLnKTmZXjFvFyHR8mwsD4Z83MeSrqYnWC14tFY=
=KneU
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've made more work into per-file compression support.
For example, F2FS_IOC_GET | SET_COMPRESS_OPTION provides a way to
change the algorithm or cluster size per file. F2FS_IOC_COMPRESS |
DECOMPRESS_FILE provides a way to compress and decompress the existing
normal files manually.
There is also a new mount option, compress_mode=fs|user, which can
control who compresses the data.
Chao also added a checksum feature with a mount option so that
we are able to detect any corrupted cluster.
In addition, Daniel contributed casefolding with encryption patch,
which will be used for Android devices.
Summary:
Enhancements:
- add ioctls and mount option to manage per-file compression feature
- support casefolding with encryption
- support checksum for compressed cluster
- avoid IO starvation by replacing mutex with rwsem
- add sysfs, max_io_bytes, to control max bio size
Bug fixes:
- fix use-after-free issue when compression and fsverity are enabled
- fix consistency corruption during fault injection test
- fix data offset for lseek
- get rid of buffer_head which has 32bits limit in fiemap
- fix some bugs in multi-partitions support
- fix nat entry count calculation in shrinker
- fix some stat information
And, we've refactored some logics and fix minor bugs as well"
* tag 'f2fs-for-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
f2fs: compress: fix compression chksum
f2fs: fix shift-out-of-bounds in sanity_check_raw_super()
f2fs: fix race of pending_pages in decompression
f2fs: fix to account inline xattr correctly during recovery
f2fs: inline: fix wrong inline inode stat
f2fs: inline: correct comment in f2fs_recover_inline_data
f2fs: don't check PAGE_SIZE again in sanity_check_raw_super()
f2fs: convert to F2FS_*_INO macro
f2fs: introduce max_io_bytes, a sysfs entry, to limit bio size
f2fs: don't allow any writes on readonly mount
f2fs: avoid race condition for shrinker count
f2fs: add F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
f2fs: add compress_mode mount option
f2fs: Remove unnecessary unlikely()
f2fs: init dirty_secmap incorrectly
f2fs: remove buffer_head which has 32bits limit
f2fs: fix wrong block count instead of bytes
f2fs: use new conversion functions between blks and bytes
f2fs: rename logical_to_blk and blk_to_logical
f2fs: fix kbytes written stat for multi-device case
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
Q1Xqs6xwww==
=zo4w
-----END PGP SIGNATURE-----
Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Another series of killing more code than what is being added, again
thanks to Christoph's relentless cleanups and tech debt tackling.
This contains:
- blk-iocost improvements (Baolin Wang)
- part0 iostat fix (Jeffle Xu)
- Disable iopoll for split bios (Jeffle Xu)
- block tracepoint cleanups (Christoph Hellwig)
- Merging of struct block_device and hd_struct (Christoph Hellwig)
- Rework/cleanup of how block device sizes are updated (Christoph
Hellwig)
- Simplification of gendisk lookup and removal of block device
aliasing (Christoph Hellwig)
- Block device ioctl cleanups (Christoph Hellwig)
- Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)
- Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)
- sbitmap improvements (Pavel Begunkov)
- Hybrid polling fix (Pavel Begunkov)
- bvec iteration improvements (Pavel Begunkov)
- Zone revalidation fixes (Damien Le Moal)
- blk-throttle limit fix (Yu Kuai)
- Various little fixes"
* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
blk-mq: fix msec comment from micro to milli seconds
blk-mq: update arg in comment of blk_mq_map_queue
blk-mq: add helper allocating tagset->tags
Revert "block: Fix a lockdep complaint triggered by request queue flushing"
nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
block: disable iopoll for split bio
block: Improve blk_revalidate_disk_zones() checks
sbitmap: simplify wrap check
sbitmap: replace CAS with atomic and
sbitmap: remove swap_lock
sbitmap: optimise sbitmap_deferred_clear()
blk-mq: skip hybrid polling if iopoll doesn't spin
blk-iocost: Factor out the base vrate change into a separate function
blk-iocost: Factor out the active iocgs' state check into a separate function
blk-iocost: Move the usage ratio calculation to the correct place
blk-iocost: Remove unnecessary advance declaration
blk-iocost: Fix some typos in comments
blktrace: fix up a kerneldoc comment
block: remove the request_queue to argument request based tracepoints
...
This patch adds max_io_bytes to limit bio size when f2fs tries to merge
consecutive IOs. This can give a testing point to split out bios and check
end_io handles those bios correctly. This is used to capture a recent bug
on the decompression and fsverity flow.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We will add a new "compress_mode" mount option to control file
compression mode. This supports "fs" and "user". In "fs" mode (default),
f2fs does automatic compression on the compression enabled files.
In "user" mode, f2fs disables the automaic compression and gives the
user discretion of choosing the target file and the timing. It means
the user can do manual compression/decompression on the compression
enabled files using ioctls.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
WARN_ON() already contains an unlikely(), so it's not necessary
to use unlikely.
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Signed-off-by: Shuosheng Huang <huangshuosheng@allwinnertech.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
For multi-device case, one f2fs image includes multi devices, so it
needs to account bytes written of all block devices belong to the image
rather than one main block device, fix it.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This patch supports to store chksum value with compressed
data, and verify the integrality of compressed data while
reading the data.
The feature can be enabled through specifying mount option
'compress_chksum'.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Lei Li reported a issue: if foreground operations are frequent, background
checkpoint may be always skipped due to below check, result in losing more
data after sudden power-cut.
f2fs_balance_fs_bg()
...
if (!is_idle(sbi, REQ_TIME) &&
(!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi)))
return;
E.g:
cp_interval = 5 second
idle_interval = 2 second
foreground operation interval = 1 second (append 1 byte per second into file)
In such case, no matter when it calls f2fs_balance_fs_bg(), is_idle(, REQ_TIME)
returns false, result in skipping background checkpoint.
This patch changes as below to make trigger condition being more reasonable:
- trigger sync_fs() if dirty_{nats,nodes} and prefree segs exceeds threshold;
- skip triggering sync_fs() if there is any background inflight IO or there is
foreground operation recently and meanwhile cp_rwsem is being held by someone;
Reported-by: Lei Li <noctis.akm@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use rwsem to ensure serialization of the callers and to avoid
starvation of high priority tasks, when the system is under
heavy IO workload.
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Expand f2fs's casefolding support to include encrypted directories. To
index casefolded+encrypted directories, we use the SipHash of the
casefolded name, keyed by a key derived from the directory's fscrypt
master key. This ensures that the dirhash doesn't leak information
about the plaintext filenames.
Encryption keys are unavailable during roll-forward recovery, so we
can't compute the dirhash when recovering a new dentry in an encrypted +
casefolded directory. To avoid having to force a checkpoint when a new
file is fsync'ed, store the dirhash on-disk appended to i_name.
This patch incorporates work by Eric Biggers <ebiggers@google.com>
and Jaegeuk Kim <jaegeuk@kernel.org>.
Co-developed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This shifts the responsibility of setting up dentry operations from
fscrypt to the individual filesystems, allowing them to have their own
operations while still setting fscrypt's d_revalidate as appropriate.
Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
they have their own specific dentry operations as well. That operation
will set the minimal d_ops required under the circumstances.
Since the fscrypt d_ops are set later on, we must set all d_ops there,
since we cannot adjust those later on. This should not result in any
change in behavior.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Use struct block_device to lookup partitions on a disk. This removes
all usage of struct hd_struct from the I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As described in "fscrypt: add fscrypt_is_nokey_name()", it's possible to
create a duplicate filename in an encrypted directory by creating a file
concurrently with adding the directory's encryption key.
Fix this bug on f2fs by rejecting no-key dentries in f2fs_add_link().
Note that the weird check for the current task in f2fs_do_add_link()
seems to make this bug difficult to reproduce on f2fs.
Fixes: 9ea97163c6 ("f2fs crypto: add filename encryption for f2fs_add_link")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201118075609.120337-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Like other filesystem does, we introduce a new file f2fs.h in path of
include/uapi/linux/, and move f2fs-specified ioctl interface definitions
to that file, after then, in order to use those definitions, userspace
developer only need to include the new header file rather than
copy & paste definitions from fs/f2fs/f2fs.h.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In this round, we've added new features such as zone capacity for ZNS and
a new GC policy, ATGC, along with in-memory segment management. In addition,
we could improve the decompression speed significantly by changing virtual
mapping method. Even though we've fixed lots of small bugs in compression
support, I feel that it becomes more stable so that I could give it a try in
production.
Enhancement:
- suport zone capacity in NVMe Zoned Namespace devices
- introduce in-memory current segment management
- add standart casefolding support
- support age threshold based garbage collection
- improve decompression speed by changing virtual mapping method
Bug fix:
- fix condition checks in some ioctl() such as compression, move_range, etc
- fix 32/64bits support in data structures
- fix memory allocation in zstd decompress
- add some boundary checks to avoid kernel panic on corrupted image
- fix disallowing compression for non-empty file
- fix slab leakage of compressed block writes
In addition, it includes code refactoring for better readability and minor
bug fixes for compression and zoned device support.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAl+I0nQACgkQQBSofoJI
UNIJdw//Rj0YapYXSu1nlOQppzSAkCiL6wxrrG2qkisRE7uXSYGZfWBqrMT6Asnf
i0f25i6ywZOZ00dgN/klZRBh4YYSgJqYx9BPkTxZsQZ/S/EmZJPpr8m4VUB69LKL
VijwUdgcW9vNDJ2/DkDQDVBd/ZqRxXnltffWtP4pS96Gj089/dE2q8KXqQrt3LM1
lLQjDfHj+0AyWRzKpErTO0W9DOgO7wmmelS0h6m2RYttkbb328JEZezg5bjWNNlk
eTXxuAFFy8Ap9DngkC/sqvY2NRTv1YgOPfrT8XWwdDIiFTZ+LoYdFI5Ap/UW7QwG
iz7B/0wPj3+9ncl536LRbFPiLisbYrArYGmZKF6t8w1cP6mTVlileMT4Q6s9+qhn
GJS88tTBVlR9vbzDu2brjI6qRQVTBdsohIGoA1g6lz0ogbphhmTzujPbFQ6GTSBi
3sKKp59urkBpVH3TVJU1oshLjIEG2yToMgYwZH9DU7zlzJS6XpetJrzReqItEThc
VNixg2DxdIFQ+nrMt+LtWaOHs5qzxIIPksguGEhqSkLL5lI75n2MZxrhKXmUsaZa
qItJE0ndJFfi6vggIkJID+a0bpTss7+AxF1AmSZDafMLkZy8j14DNQAnmBUYRX0J
5QExZ+LyyaKjWQ/k9SsSzV1Y3dbguDyB+gkeMhr/6XEr9DSwZc8=
=exgv
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, we've added new features such as zone capacity for ZNS
and a new GC policy, ATGC, along with in-memory segment management. In
addition, we could improve the decompression speed significantly by
changing virtual mapping method. Even though we've fixed lots of small
bugs in compression support, I feel that it becomes more stable so
that I could give it a try in production.
Enhancements:
- suport zone capacity in NVMe Zoned Namespace devices
- introduce in-memory current segment management
- add standart casefolding support
- support age threshold based garbage collection
- improve decompression speed by changing virtual mapping method
Bug fixes:
- fix condition checks in some ioctl() such as compression, move_range, etc
- fix 32/64bits support in data structures
- fix memory allocation in zstd decompress
- add some boundary checks to avoid kernel panic on corrupted image
- fix disallowing compression for non-empty file
- fix slab leakage of compressed block writes
In addition, it includes code refactoring for better readability and
minor bug fixes for compression and zoned device support"
* tag 'f2fs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (51 commits)
f2fs: code cleanup by removing unnecessary check
f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
f2fs: fix writecount false positive in releasing compress blocks
f2fs: introduce check_swap_activate_fast()
f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
f2fs: handle errors of f2fs_get_meta_page_nofail
f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
f2fs: reject CASEFOLD inode flag without casefold feature
f2fs: fix memory alignment to support 32bit
f2fs: fix slab leak of rpages pointer
f2fs: compress: fix to disallow enabling compress on non-empty file
f2fs: compress: introduce cic/dic slab cache
f2fs: compress: introduce page array slab cache
f2fs: fix to do sanity check on segment/section count
f2fs: fix to check segment boundary during SIT page readahead
f2fs: fix uninit-value in f2fs_lookup
f2fs: remove unneeded parameter in find_in_block()
f2fs: fix wrong total_sections check and fsmeta check
f2fs: remove duplicated code in sanity_check_area_boundary
f2fs: remove unused check on version_bitmap
...
First problem is we hit BUG_ON() in f2fs_get_sum_page given EIO on
f2fs_get_meta_page_nofail().
Quick fix was not to give any error with infinite loop, but syzbot caught
a case where it goes to that loop from fuzzed image. In turned out we abused
f2fs_get_meta_page_nofail() like in the below call stack.
- f2fs_fill_super
- f2fs_build_segment_manager
- build_sit_entries
- get_current_sit_page
INFO: task syz-executor178:6870 can't die for more than 143 seconds.
task:syz-executor178 state:R
stack:26960 pid: 6870 ppid: 6869 flags:0x00004006
Call Trace:
Showing all locks held in the system:
1 lock held by khungtaskd/1179:
#0: ffffffff8a554da0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6242
1 lock held by systemd-journal/3920:
1 lock held by in:imklog/6769:
#0: ffff88809eebc130 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:930
1 lock held by syz-executor178/6870:
#0: ffff8880925120e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0x201/0xaf0 fs/super.c:229
Actually, we didn't have to use _nofail in this case, since we could return
error to mount(2) already with the error handler.
As a result, this patch tries to 1) remove _nofail callers as much as possible,
2) deal with error case in last remaining caller, f2fs_get_sum_page().
Reported-by: syzbot+ee250ac8137be41d7b13@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In 32bit system, 64-bits key breaks memory alignment.
This fixes the commit "f2fs: support 64-bits key in f2fs rb-tree node entry".
Reported-by: Nicolas Chauvet <kwizart@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Add two slab caches: "f2fs_cic_entry" and "f2fs_dic_entry" for memory
allocation of compress_io_ctx and decompress_io_ctx structure.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The behavior of the test_dummy_encryption mount option is that when a
new file (or directory or symlink) is created in an unencrypted
directory, it's automatically encrypted using a dummy encryption policy.
That's it; in particular, the encryption (or lack thereof) of existing
files (or directories or symlinks) doesn't change.
Unfortunately the implementation of test_dummy_encryption is a bit weird
and confusing. When test_dummy_encryption is enabled and a file is
being created in an unencrypted directory, we set up an encryption key
(->i_crypt_info) for the directory. This isn't actually used to do any
encryption, however, since the directory is still unencrypted! Instead,
->i_crypt_info is only used for inheriting the encryption policy.
One consequence of this is that the filesystem ends up providing a
"dummy context" (policy + nonce) instead of a "dummy policy". In
commit ed318a6cc0 ("fscrypt: support test_dummy_encryption=v2"), I
mistakenly thought this was required. However, actually the nonce only
ends up being used to derive a key that is never used.
Another consequence of this implementation is that it allows for
'inode->i_crypt_info != NULL && !IS_ENCRYPTED(inode)', which is an edge
case that can be forgotten about. For example, currently
FS_IOC_GET_ENCRYPTION_POLICY on an unencrypted directory may return the
dummy encryption policy when the filesystem is mounted with
test_dummy_encryption. That seems like the wrong thing to do, since
again, the directory itself is not actually encrypted.
Therefore, switch to a more logical and maintainable implementation
where the dummy encryption policy inheritance is done without setting up
keys for unencrypted directories. This involves:
- Adding a function fscrypt_policy_to_inherit() which returns the
encryption policy to inherit from a directory. This can be a real
policy, a dummy policy, or no policy.
- Replacing struct fscrypt_dummy_context, ->get_dummy_context(), etc.
with struct fscrypt_dummy_policy, ->get_dummy_policy(), etc.
- Making fscrypt_fname_encrypted_size() take an fscrypt_policy instead
of an inode.
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Acked-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Convert f2fs to use the new functions fscrypt_prepare_new_inode() and
fscrypt_set_context(). This avoids calling
fscrypt_get_encryption_info() from under f2fs_lock_op(), which can
deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.
For more details about this problem, see the earlier patch
"fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".
This also fixes a f2fs-specific deadlock when the filesystem is mounted
with '-o test_dummy_encryption' and a file is created in an unencrypted
directory other than the root directory:
INFO: task touch:207 blocked for more than 30 seconds.
Not tainted 5.9.0-rc4-00099-g729e3d0919844 #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:touch state:D stack: 0 pid: 207 ppid: 167 flags:0x00000000
Call Trace:
[...]
lock_page include/linux/pagemap.h:548 [inline]
pagecache_get_page+0x25e/0x310 mm/filemap.c:1682
find_or_create_page include/linux/pagemap.h:348 [inline]
grab_cache_page include/linux/pagemap.h:424 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2395 [inline]
f2fs_grab_cache_page fs/f2fs/f2fs.h:2373 [inline]
__get_node_page.part.0+0x39/0x2d0 fs/f2fs/node.c:1350
__get_node_page fs/f2fs/node.c:35 [inline]
f2fs_get_node_page+0x2e/0x60 fs/f2fs/node.c:1399
read_inline_xattr+0x88/0x140 fs/f2fs/xattr.c:288
lookup_all_xattrs+0x1f9/0x2c0 fs/f2fs/xattr.c:344
f2fs_getxattr+0x9b/0x160 fs/f2fs/xattr.c:532
f2fs_get_context+0x1e/0x20 fs/f2fs/super.c:2460
fscrypt_get_encryption_info+0x9b/0x450 fs/crypto/keysetup.c:472
fscrypt_inherit_context+0x2f/0xb0 fs/crypto/policy.c:640
f2fs_init_inode_metadata+0xab/0x340 fs/f2fs/dir.c:540
f2fs_add_inline_entry+0x145/0x390 fs/f2fs/inline.c:621
f2fs_add_dentry+0x31/0x80 fs/f2fs/dir.c:757
f2fs_do_add_link+0xcd/0x130 fs/f2fs/dir.c:798
f2fs_add_link fs/f2fs/f2fs.h:3234 [inline]
f2fs_create+0x104/0x290 fs/f2fs/namei.c:344
lookup_open.isra.0+0x2de/0x500 fs/namei.c:3103
open_last_lookups+0xa9/0x340 fs/namei.c:3177
path_openat+0x8f/0x1b0 fs/namei.c:3365
do_filp_open+0x87/0x130 fs/namei.c:3395
do_sys_openat2+0x96/0x150 fs/open.c:1168
[...]
That happened because f2fs_add_inline_entry() locks the directory
inode's page in order to add the dentry, then f2fs_get_context() tries
to lock it recursively in order to read the encryption xattr. This
problem is specific to "test_dummy_encryption" because normally the
directory's fscrypt_info would be set up prior to
f2fs_add_inline_entry() in order to encrypt the new filename.
Regardless, the new design fixes this test_dummy_encryption deadlock as
well as potential deadlocks with fs reclaim, by setting up any needed
fscrypt_info structs prior to taking so many locks.
The test_dummy_encryption deadlock was reported by Daniel Rosenberg.
Reported-by: Daniel Rosenberg <drosen@google.com>
Acked-by: Jaegeuk Kim <jaegeuk@kernel.org>
Link: https://lore.kernel.org/r/20200917041136.178600-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
The returned integer is not required anywhere. So we need to change
the return value to bool type.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
writepages() can be concurrently invoked for the same file by different
threads such as a thread fsyncing the file and a kworker kernel thread.
So, changing i_compr_blocks without protection is racy and we need to
protect it by changing it with atomic type value. Plus, we don't need
a 64bit value for i_compr_blocks, so just we will use a atomic value,
not atomic64.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
As 5kft <5kft@5kft.org> reported:
kworker/u9:3: page allocation failure: order:9, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 3 PID: 8168 Comm: kworker/u9:3 Tainted: G C 5.8.3-sunxi #trunk
Hardware name: Allwinner sun8i Family
Workqueue: f2fs_post_read_wq f2fs_post_read_work
[<c010d6d5>] (unwind_backtrace) from [<c0109a55>] (show_stack+0x11/0x14)
[<c0109a55>] (show_stack) from [<c056d489>] (dump_stack+0x75/0x84)
[<c056d489>] (dump_stack) from [<c0243b53>] (warn_alloc+0xa3/0x104)
[<c0243b53>] (warn_alloc) from [<c024473b>] (__alloc_pages_nodemask+0xb87/0xc40)
[<c024473b>] (__alloc_pages_nodemask) from [<c02267c5>] (kmalloc_order+0x19/0x38)
[<c02267c5>] (kmalloc_order) from [<c02267fd>] (kmalloc_order_trace+0x19/0x90)
[<c02267fd>] (kmalloc_order_trace) from [<c047c665>] (zstd_init_decompress_ctx+0x21/0x88)
[<c047c665>] (zstd_init_decompress_ctx) from [<c047e9cf>] (f2fs_decompress_pages+0x97/0x228)
[<c047e9cf>] (f2fs_decompress_pages) from [<c045d0ab>] (__read_end_io+0xfb/0x130)
[<c045d0ab>] (__read_end_io) from [<c045d141>] (f2fs_post_read_work+0x61/0x84)
[<c045d141>] (f2fs_post_read_work) from [<c0130b2f>] (process_one_work+0x15f/0x3b0)
[<c0130b2f>] (process_one_work) from [<c0130e7b>] (worker_thread+0xfb/0x3e0)
[<c0130e7b>] (worker_thread) from [<c0135c3b>] (kthread+0xeb/0x10c)
[<c0135c3b>] (kthread) from [<c0100159>]
zstd may allocate large size memory for {,de}compression, it may cause
file copy failure on low-end device which has very few memory.
For decompression, let's just allocate proper size memory based on current
file's cluster size instead of max cluster size.
Reported-by: 5kft <5kft@5kft.org>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Current compr_blocks of superblock info is not 64bit value. We are
accumulating each i_compr_blocks count of inodes to this value and
those are 64bit values. So, need to change this to 64bit value.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are several issues in current background GC algorithm:
- valid blocks is one of key factors during cost overhead calculation,
so if segment has less valid block, however even its age is young or
it locates hot segment, CB algorithm will still choose the segment as
victim, it's not appropriate.
- GCed data/node will go to existing logs, no matter in-there datas'
update frequency is the same or not, it may mix hot and cold data
again.
- GC alloctor mainly use LFS type segment, it will cost free segment
more quickly.
This patch introduces a new algorithm named age threshold based
garbage collection to solve above issues, there are three steps
mainly:
1. select a source victim:
- set an age threshold, and select candidates beased threshold:
e.g.
0 means youngest, 100 means oldest, if we set age threshold to 80
then select dirty segments which has age in range of [80, 100] as
candiddates;
- set candidate_ratio threshold, and select candidates based the
ratio, so that we can shrink candidates to those oldest segments;
- select target segment with fewest valid blocks in order to
migrate blocks with minimum cost;
2. select a target victim:
- select candidates beased age threshold;
- set candidate_radius threshold, search candidates whose age is
around source victims, searching radius should less than the
radius threshold.
- select target segment with most valid blocks in order to avoid
migrating current target segment.
3. merge valid blocks from source victim into target victim with
SSR alloctor.
Test steps:
- create 160 dirty segments:
* half of them have 128 valid blocks per segment
* left of them have 384 valid blocks per segment
- run background GC
Benefit: GC count and block movement count both decrease obviously:
- Before:
- Valid: 86
- Dirty: 1
- Prefree: 11
- Free: 6001 (6001)
GC calls: 162 (BG: 220)
- data segments : 160 (160)
- node segments : 2 (2)
Try to move 41454 blocks (BG: 41454)
- data blocks : 40960 (40960)
- node blocks : 494 (494)
IPU: 0 blocks
SSR: 0 blocks in 0 segments
LFS: 41364 blocks in 81 segments
- After:
- Valid: 87
- Dirty: 0
- Prefree: 4
- Free: 6008 (6008)
GC calls: 75 (BG: 76)
- data segments : 74 (74)
- node segments : 1 (1)
Try to move 12813 blocks (BG: 12813)
- data blocks : 12544 (12544)
- node blocks : 269 (269)
IPU: 0 blocks
SSR: 12032 blocks in 77 segments
LFS: 855 blocks in 2 segments
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix a bug along with pinfile in-mem segment & clean up]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
This switches f2fs over to the generic support provided in
the previous patch.
Since casefolded dentries behave the same in ext4 and f2fs, we decrease
the maintenance burden by unifying them, and any optimizations will
immediately apply to both.
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
refcount_t type variable should never be less than one, so it's a
little bit hard to understand when we use it to indicate pending
compressed page count, let's change to use atomic_t for better
readability.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
then, we can add specified entry into rb-tree with 64-bits segment time
as key.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't let f2fs inner GC ruins original aging degree of segment.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Previous implementation of aligned pinfile allocation will:
- allocate new segment on cold data log no matter whether last used
segment is partially used or not, it makes IOs more random;
- force concurrent cold data/GCed IO going into warm data area, it
can make a bad effect on hot/cold data separation;
In this patch, we introduce a new type of log named 'inmem curseg',
the differents from normal curseg is:
- it reuses existed segment type (CURSEG_XXX_NODE/DATA);
- it only exists in memory, its segno, blkofs, summary will not b
persisted into checkpoint area;
With this new feature, we can enhance scalability of log, special
allocators can be created for purposes:
- pure lfs allocator for aligned pinfile allocation or file
defragmentation
- pure ssr allocator for later feature
So that, let's update aligned pinfile allocation to use this new
inmem curseg fwk.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since DUMMY_WRITTEN_PAGE and ATOMIC_WRITTEN_PAGE have already been
converted as unsigned long type, we don't need do type casting again.
Signed-off-by: Xiaojun Wang <wangxiaojun11@huawei.com>
Reported-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
NVMe Zoned Namespace devices can have zone-capacity less than zone-size.
Zone-capacity indicates the maximum number of sectors that are usable in
a zone beginning from the first sector of the zone. This makes the sectors
sectors after the zone-capacity till zone-size to be unusable.
This patch set tracks zone-size and zone-capacity in zoned devices and
calculate the usable blocks per segment and usable segments per section.
If zone-capacity is less than zone-size mark only those segments which
start before zone-capacity as free segments. All segments at and beyond
zone-capacity are treated as permanently used segments. In cases where
zone-capacity does not align with segment size the last segment will start
before zone-capacity and end beyond the zone-capacity of the zone. For
such spanning segments only sectors within the zone-capacity are used.
During writes and GC manage the usable segments in a section and usable
blocks per segment. Segments which are beyond zone-capacity are never
allocated, and do not need to be garbage collected, only the segments
which are before zone-capacity needs to garbage collected.
For spanning segments based on the number of usable blocks in that
segment, write to blocks only up to zone-capacity.
Zone-capacity is device specific and cannot be configured by the user.
Since NVMe ZNS device zones are sequentially write only, a block device
with conventional zones or any normal block device is needed along with
the ZNS device for the metadata operations of F2fs.
A typical nvme-cli output of a zoned device shows zone start and capacity
and write pointer as below:
SLBA: 0x0 WP: 0x0 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x20000 WP: 0x20000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
SLBA: 0x40000 WP: 0x40000 Cap: 0x18800 State: EMPTY Type: SEQWRITE_REQ
Here zone size is 64MB, capacity is 49MB, WP is at zone start as the zones
are in EMPTY state. For each zone, only zone start + 49MB is usable area,
any lba/sector after 49MB cannot be read or written to, the drive will fail
any attempts to read/write. So, the second zone starts at 64MB and is
usable till 113MB (64 + 49) and the range between 113 and 128MB is
again unusable. The next zone starts at 128MB, and so on.
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since set/clear_inode_flag() don't need to return value to show
if flag is set, we can just call set/clear_bit() here.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Just for code style, no logic change
1. delete useless space
2. change spaces into tab
Signed-off-by: Jack Qiu <jack.qiu@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The return value of f2fs_flush_inline_data() is not used,
so delete it.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added a new ioctl to send discard commands or/and zero out
to selected data area of a regular file for security reason.
The way of handling range.len of F2FS_IOC_SEC_TRIM_FILE:
1. Added -1 value support for range.len to secure trim the whole blocks
starting from range.start regardless of i_size.
2. If the end of the range passes over the end of file, it means until
the end of file (i_size).
3. ignored the case of that range.len is zero to prevent the function
from making end_addr zero and triggering different behaviour of
the function.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Don't define F2FS_IOC_* aliases to ioctls that already have a generic
FS_IOC_* name. These aliases are unnecessary, and they make it unclear
which ioctls are f2fs-specific and which are generic.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
- don't panic kernel if f2fs_get_node_page() fails in
f2fs_recover_inline_data() or f2fs_recover_inline_xattr();
- return error number of f2fs_truncate_blocks() to
f2fs_recover_inline_data()'s caller;
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Added a new gc_urgent mode, GC_URGENT_LOW, in which mode
F2FS will lower the bar of checking idle in order to
process outstanding discard commands and GC a little bit
aggressively.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>