When considering a bunch of data writes with very frequent fsync calls, we
are able to think the following performance regression.
N: Node IO, D: Data IO, IO scheduler: cfq
Issue pending IOs
D1 D2 D3 D4
D1 D2 D3 D4 N1
D2 D3 D4 N1 N2
N1 D3 D4 N2 D1
--> N1 can be selected by cfq becase of the same priority of N and D.
Then D3 and D4 would be delayed, resuling in performance degradation.
So, when processing the fsync call, it'd better give higher priority to data IOs
than node IOs by assigning WRITE and WRITE_SYNC respectively.
This patch improves the random wirte performance with frequent fsync calls by up
to 10%.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a inline_data recovery routine with the following policy.
[prev.] [next] of inline_data flag
o o -> recover inline_data
o x -> remove inline_data, and then recover data blocks
x o -> remove inline_data, and then recover inline_data
x x -> recover data blocks
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds the number of inline_data files into the status information.
Note that the number is reset whenever the filesystem is newly mounted.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o handle NULL pointer of grab_cache_page_write_begin() pointed by Chao Yu.
This patch refactors f2fs_convert_inline_data to check a couple of conditions
internally for deciding whether it needs to convert inline_data or not.
So, the new f2fs_convert_inline_data initially checks:
1) f2fs_has_inline_data(), and
2) the data size to be changed.
If the inode has inline_data but the size to fill is less than MAX_INLINE_DATA,
then we don't need to convert the inline_data with data allocation.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Functions to implement inline data read/write, and move inline data to
normal data block when file size exceeds inline data limitation.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, we need to calculate the max orphan num when we try to acquire an
orphan inode, but it's a stable value since the super block was inited. So
converting it to a field of f2fs_sb_info and use it directly when needed seems
a better choose.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces F2FS_INODE that returns struct f2fs_inode * from the inode
page.
By using this macro, we can remove unnecessary casting codes like below.
struct f2fs_inode *ri = &F2FS_NODE(inode_page)->i;
-> struct f2fs_inode *ri = F2FS_INODE(inode_page);
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Update several comments:
1. use f2fs_{un}lock_op install of mutex_{un}lock_op.
2. update comment of get_data_block().
3. update description of node offset.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When using the f2fs_io_info in the low level, we still need to merge the
rw and rw_flag, so use the rw to hold all the io flags directly,
and remove the rw_flag field.
ps.It is based on the previous patch:
f2fs: move all the bio initialization into __bio_alloc
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs doesn't support direct IOs with high performance, which throws
every write requests via the buffered write path, resulting in highly
performance degradation due to memory opeations like copy_from_user.
This patch introduces a new direct IO path in which every write requests are
processed by generic blockdev_direct_IO() with enhanced get_block function.
The get_data_block() in f2fs handles:
1. if original data blocks are allocates, then give them to blockdev.
2. otherwise,
a. preallocate requested block addresses
b. do not use extent cache for better performance
c. give the block addresses to blockdev
This policy induces that:
- new allocated data are sequentially written to the disk
- updated data are randomly written to the disk.
- f2fs gives consistency on its file meta, not file data.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces new sysfs entries for users to control the policy of
in-place-updates, namely IPU, in f2fs.
Sometimes f2fs suffers from performance degradation due to its out-of-place
update policy that produces many additional node block writes.
If the storage performance is very dependant on the amount of data writes
instead of IO patterns, we'd better drop this out-of-place update policy.
This patch suggests 5 polcies and their triggering conditions as follows.
[sysfs entry name = ipu_policy]
0: F2FS_IPU_FORCE all the time,
1: F2FS_IPU_SSR if SSR mode is activated,
2: F2FS_IPU_UTIL if FS utilization is over threashold,
3: F2FS_IPU_SSR_UTIL if SSR mode is activated and FS utilization is over
threashold,
4: F2FS_IPU_DISABLE disable IPU. (=default option)
[sysfs entry name = min_ipu_util]
This parameter controls the threshold to trigger in-place-updates.
The number indicates percentage of the filesystem utilization, and used by
F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.
For more details, see need_inplace_update() in segment.h.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces f2fs_io_info to mitigate the complex parameter list.
struct f2fs_io_info {
enum page_type type; /* contains DATA/NODE/META/META_FLUSH */
int rw; /* contains R/RS/W/WS */
int rw_flag; /* contains REQ_META/REQ_PRIO */
}
1. f2fs_write_data_pages
- DATA
- WRITE_SYNC is set when wbc->WB_SYNC_ALL.
2. sync_node_pages
- NODE
- WRITE_SYNC all the time
3. sync_meta_pages
- META
- WRITE_SYNC all the time
- REQ_META | REQ_PRIO all the time
** f2fs_submit_merged_bio() handles META_FLUSH.
4. ra_nat_pages, ra_sit_pages, ra_sum_pages
- META
- READ_SYNC
Cc: Fan Li <fanofcode.li@samsung.com>
Cc: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously f2fs submits most of write requests using WRITE_SYNC, but f2fs_write_data_pages
submits last write requests by sync_mode flags callers pass.
This causes a performance problem since continuous pages with different sync flags
can't be merged in cfq IO scheduler(thanks yu chao for pointing it out), and synchronous
requests often take more time.
This patch makes the following modifies to DATA writebacks:
1. every page will be written back using the sync mode caller pass.
2. only pages with the same sync mode can be merged in one bio request.
These changes are restricted to DATA pages.Other types of writebacks are modified
To remain synchronous.
In my test with tiotest, f2fs sequence write performance is improved by about 7%-10% ,
and this patch has no obvious impact on other performance tests.
Signed-off-by: Fan Li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As we know, some of our branch condition will rarely be true. So we could add
'unlikely' to let compiler optimize these code, by this way we could drop
unneeded 'jump' assemble code to improve performance.
change log:
o add *unlikely* as many as possible across the whole source files at once
suggested by Jaegeuk Kim.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch integrates redundant bio operations on read and write IOs.
1. Move bio-related codes to the top of data.c.
2. Replace f2fs_submit_bio with f2fs_submit_merged_bio, which handles read
bios additionally.
3. Introduce __submit_merged_bio to submit the merged bio.
4. Change f2fs_readpage to f2fs_submit_page_bio.
5. Introduce f2fs_submit_page_mbio to integrate previous submit_read_page and
submit_write_page.
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com >
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The recover_orphan_inodes() returns no error all the time, so we don't need to
check its errors.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes the unnecessary condition checks on:
fs/f2fs/gc.c:667 do_garbage_collect() warn: 'sum_page' isn't an ERR_PTR
fs/f2fs/f2fs.h:795 f2fs_put_page() warn: 'page' isn't an ERR_PTR
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add new inode flags F2FS_INLINE_DATA and FI_INLINE_DATA to indicate
whether the inode has inline data.
Inline data makes use of inode block's data indices region to save small
file. Currently there are 923 data indices in an inode block. Since
inline xattr has made use of the last 50 indices to save its data, there
are 873 indices left which can be used for inline data. When
FI_INLINE_DATA is set, the layout of inode block's indices region is
like below:
+-----------------+
| | Reserved. reserve_new_block() will make use of
| i_addr[0] | i_addr[0] when we need to reserve a new data block
| | to convert inline data into regular one's.
|-----------------|
| | Used by inline data. A file whose size is less than
| i_addr[1~872] | 3488 bytes(~3.4k) and doesn't reserve extra
| | blocks by fallocate() can be saved here.
|-----------------|
| |
| i_addr[873~922] | Reserved for inline xattr
| |
+-----------------+
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add the function f2fs_reserve_block() to easily reserve new blocks, and
use it to clean up more codes.
Signed-off-by: Huajun Li <huajun.li@intel.com>
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Weihong Xu <weihong.xu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
For better read performance, we add a new function to support for merging
contiguous read as the one for write.
v1-->v2:
o add declarations here as Gu Zheng suggested.
o use new structure f2fs_bio_info introduced by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Acked-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
The f2fs has three bio types, NODE, DATA, and META, and manages some data
structures per each bio types.
The codes are a little bit messy, thus, this patch introduces a bio array
which groups individual data structures as follows.
struct f2fs_bio_info {
struct bio *bio; /* bios to merge */
sector_t last_block_in_bio; /* last block number */
struct mutex io_mutex; /* mutex for bio */
};
struct f2fs_sb_info {
...
struct f2fs_bio_info write_io[NR_PAGE_TYPE]; /* for write bios */
...
};
The code changes from this new data structure are trivial.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The f2fs manages an extent cache to search a number of consecutive data blocks
very quickly.
However it conducts unnecessary cache operations if the file is highly
fragmented with no valid extent cache.
In such the case, we don't need to handle the extent cache, but just can disable
the cache facility.
Nevertheless, this patch gives one more chance to enable the extent cache.
For example,
1. create a file
2. write data sequentially which produces a large valid extent cache
3. update some data, resulting in a fragmented extent
4. if the fragmented extent is too small, then drop extent cache
5. close the file
6. open the file again
7. give another chance to make a new extent cache
8. write data sequentially again which creates another big extent cache.
...
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes an unnecessary semaphore (i.e., sbi->bio_sem).
There is no reason to use the semaphore when f2fs submits read and write IOs.
Instead, let's use a write mutex and cover the sbi->bio[] by the lock.
Change log from v1:
o split write_mutex suggested by Chao Yu
Chao described,
"All DATA/NODE/META bio buffers in superblock is protected by
'sbi->write_mutex', but each bio buffer area is independent, So we
should split write_mutex to three for DATA/NODE/META."
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a slab cache entry for small discards.
Each entry consists of:
struct discard_entry {
struct list_head list; /* list head */
block_t blkaddr; /* block address to be discarded */
int len; /* # of consecutive blocks of the discard */
};
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
use genernal method supported by kernel
o changes from v1
If any waiter exists at end io, wake up it.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs_sync_file() waits for all the node blocks to be written.
But, we don't need to do that, but wait only the inode-related node blocks.
This patch adds wait_on_node_pages_writeback() in which waits inode-related
node blocks that are on writeback.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If you want to remove unnecessary BUG_ONs, you can just turn off F2FS_CHECK_FS
in your kernel config.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch merges some background jobs into this new function.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, f2fs postpones reclaiming prefree segments into free segments
as much as possible.
However, if user writes and deletes a bunch of data without any sync or fsync
calls, some flash storages can suffer from garbage collections.
So, this patch adds the reclaiming codes to f2fs_write_node_pages and background
GC thread.
If there are a lot of prefree segments, let's do checkpoint so that f2fs
submits discard commands for the prefree regions to the flash storage.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce the unfailed version of kmem_cache_alloc named f2fs_kmem_cache_alloc
to hide the retry routine and make the code a bit cleaner.
v2:
Fix the wrong use of 'retry' tag pointed out by Gao feng.
Use more neat code to remove redundant tag suggested by Haicheng Li.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously, do_checkpoint() will call congestion_wait() for waiting the pages
(previous submitted node/meta/data pages) to be written back.
Because congestion_wait() will set a regular period (e.g. HZ / 50 ) for waiting, and
no additional wake up mechanism was introduced if IO ends up before regular period costed.
Yuan Zhong found there is a situation that after the pages have been written back,
but the checkpoint thread still wait for congestion_wait to exit.
So here we store checkpoint task into f2fs_sb when doing checkpoint, it'll wait for IO completes
if there's IO going on, and in the end IO path, wake up checkpoint task when IO ends up.
Thanks to Yuan Zhong's pre work about this problem.
Reported-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch removes the logic previously introduced to address the starvation
on cp_rwsem.
One potential there-in bug is that we should cover the wait.list with spin_lock,
but the previous code broke this rule.
And, actually current rwsem handles this starvation issue reasonably, so that we
didn't need to do this before neither.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The fs_locks is used to block other ops(ex, recovery) when doing checkpoint.
And each other operate routine(besides checkpoint) needs to acquire a fs_lock,
there is a terrible problem here, if these are too many concurrency threads acquiring
fs_lock, so that they will block each other and may lead to some performance problem,
but this is not the phenomenon we want to see.
Though there are some optimization patches introduced to enhance the usage of fs_lock,
but the thorough solution is using a *rw_sem* to replace the fs_lock.
Checkpoint routine takes write_sem, and other ops take read_sem, so that we can block
other ops(ex, recovery) when doing checkpoint, and other ops will not disturb each other,
this can avoid the problem described above completely.
Because of the weakness of rw_sem, the above change may introduce a potential problem
that the checkpoint thread might get starved if other threads are intensively locking
the read semaphore for I/O.(Pointed out by Xu Jin)
In order to avoid this, a wait_list is introduced, the appending read semaphore ops
will be dropped into the wait_list if checkpoint thread is waiting for write semaphore,
and will be waked up when checkpoint thread gives up write semaphore.
Thanks to Kim's previous review and test, and will be very glad to see other guys'
performance tests about this patch.
V2:
-fix the potential starvation problem.
-use more suitable func name suggested by Xu Jin.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: adjust minor coding standard]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
There is a performance problem: when all sbi->fs_lock are holded, then
all the following threads may get the same next_lock value from sbi->next_lock_num
in function mutex_lock_op, and wait for the same lock(fs_lock[next_lock]),
it may cause performance reduce.
So we move the sbi->next_lock_num++ before getting lock, this will average the
following threads if all sbi->fs_lock are holded.
v1-->v2:
Drop the needless spin_lock as Jaegeuk suggested.
Suggested-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Yu Chao <chao2.yu@samsung.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
0. modified inode structure
--------------------------------------
metadata (e.g., i_mtime, i_ctime, etc)
--------------------------------------
direct pointers [0 ~ 873]
inline xattrs (200 bytes by default)
indirect pointers [0 ~ 4]
--------------------------------------
node footer
--------------------------------------
1. setxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- handle xattr entries
- write_all_xattrs copies modified xattrs into inline and xattr node block.
2. getxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- check target entries
3. Usage
# mount -t f2fs -o inline_xattr $DEV $MNT
Once mounted with the inline_xattr option, f2fs marks all the newly created
files to reserve an amount of inline xattr space explicitly inside the inode
block. Without the mount option, f2fs will not touch any existing files and
newly created files as well.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enables the number of direct pointers inside on-disk inode block to
be changed dynamically according to the size of inline xattr space.
The number of direct pointers, ADDRS_PER_INODE, can be changed only if the file
has inline xattr flag.
The number of direct pointers that will be used by inline xattrs is defined as
F2FS_INLINE_XATTR_ADDRS.
Current patch assigns F2FS_INLINE_XATTR_ADDRS to 0 temporarily.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds basic inode flags for inline xattrs, F2FS_INLINE_XATTR,
and add a mount option, inline_xattr, which is enabled when xattr is set.
If the mount option is enabled, all the files are marked with the inline_xattrs
flag.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously xattr node blocks are stored to the COLD_NODE log, which means that
our roll-forward mechanism doesn't recover the xattr node blocks at all.
Only the direct node blocks in the WARM_NODE log can be recovered.
So, let's resolve the issue simply by conducting checkpoint during fsync when a
file has a modified xattr node block.
This approach is able to degrade the performance, but normally the checkpoint
overhead is shown at the initial fsync call after the xattr entry changes.
Once the checkpoint is done, no additional overhead would be occurred.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes the use of XATTR_NODE_OFFSET.
o The offset should not use several MSB bits which are used by marking node
blocks.
o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality
during the fsync call.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the following error reported by kbuild test robot.
All error/warnings:
In file included from fs/f2fs/dir.c:13:0:
>> fs/f2fs/f2fs.h:435:17: error: field 's_kobj' has incomplete type
struct kobject s_kobj;
The failure was caused by missing the kobject header file in dir.c.
So, this patch move the header file to the right location, f2fs.h.
CC: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>