Commit Graph

494902 Commits

Author SHA1 Message Date
Jaegeuk Kim 11504a8e7e f2fs: avoid write_checkpoint if f2fs is mounted readonly
Do not change any partition when f2fs is changed to readonly mode.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:40 -08:00
Jaegeuk Kim 2d834bf9ac f2fs: support norecovery mount option
This patch adds a mount option, norecovery, which is mostly same as
disable_roll_forward. The only difference is that norecovery should be activated
with read-only mount option.

This can be used when user wants to check whether f2fs is mountable or not
without any recovery process. (e.g., xfstests/200)

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:40 -08:00
Jaegeuk Kim dabc4a5c60 f2fs: fix not to drop mount options when retrying fill_super
If wrong mount option was requested, f2fs tries to fill_super again.
But, during the next trial, f2fs has no valid mount options, since
parse_options deleted all the separators in the original string.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:39 -08:00
Chao Yu caf0047e7e f2fs: merge flags in struct f2fs_sb_info
Currently, there are several variables with Boolean type as below:

struct f2fs_sb_info {
...
	int s_dirty;
	bool need_fsck;
	bool s_closing;
...
	bool por_doing;
...
}

For this there are some issues:
1. there are some space of f2fs_sb_info is wasted due to aligning after Boolean
   type variables by compiler.
2. if we continuously add new flag into f2fs_sb_info, structure will be messed
   up.

So in this patch, we try to:
1. switch s_dirty to Boolean type variable since it has two status 0/1.
2. merge s_dirty/need_fsck/s_closing/por_doing variables into s_flag.
3. introduce an enum type which can indicate different states of sbi.
4. use new introduced universal interfaces is_sbi_flag_set/{set,clear}_sbi_flag
   to operate flags for sbi.

After that, above issues will be fixed.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:38 -08:00
Chao Yu 88dd893419 f2fs: clean up {in,de}create_sleep_time
Use pointer parameter @wait to pass result in {in,de}create_sleep_time for
cleanup.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:37 -08:00
Chao Yu feeb0debfb f2fs: make truncate_inline_date static
1. make truncate_inline_date static;
2. remove parameter @from of truncate_inline_date as callers only pass zero.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:37 -08:00
Kinglong Mee 3b6709b771 f2fs: fix a bug of inheriting default ACL from parent
Introduced by a6dda0e63e
"f2fs: use generic posix ACL infrastructure".

When testing default acl, gets in recent kernel (3.19.0-rc5),
user::rwx
group::r-x
other::r-x
default:user::rwx
default:group::r-x
default:group:root:rwx
default😷:rwx
default:other::r-x

]# getfacl testdir/
user::rwx
group::rwx
                // missing an acl "group:root:rwx" inherited from parent
other::r-x
default:user::rwx
default:group::r-x
default:group:root:rwx
default😷:rwx
default:other::r-x

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:36 -08:00
Chao Yu f28e503429 f2fs: use f2fs_radix_tree_insert to clean codes
No modification in functionality, just clean codes with f2fs_radix_tree_insert.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:35 -08:00
Chao Yu d49f3e8902 f2fs: add F2FS_IOC_GETVERSION support
In this patch we add the FS_IOC_GETVERSION ioctl for getting i_generation from
inode, after that, users can list file's generation number by using "lsattr -v".

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:35 -08:00
Jaegeuk Kim bc4a1f873b f2fs: leave comment for code readability
During the recovery, any xattr blocks should not be found, since they are
written into cold log, not the warm node chain.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:34 -08:00
Chao Yu 1601839e9e f2fs: fix to release count of meta page in ->invalidatepage
We will encounter deadloop in below scenario:

1. increase page count for F2FS_DIRTY_META type in following path:
->recover_fsync_data
  ->recover_data
    ->do_recover_data
      ->recover_data_page
        ->change_curseg
          ->write_sum_page
            ->set_page_dirty
2. fail in recover_data()
3. invalidate meta pages in truncate_inode_pages_final without decreasing page
   count.
4. deadloop when sync_meta_pages as page count will always be non-zero.

message:
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!

 [<c1129a37>] pagevec_lookup_tag+0x27/0x30
 [<f0e774c7>] sync_meta_pages+0x87/0x160 [f2fs]
 [<f0e86dd9>] recover_fsync_data+0xeb9/0xf10 [f2fs]
 [<f0e75398>] f2fs_fill_super+0x888/0x980 [f2fs]
 [<c11733ca>] mount_bdev+0x16a/0x1a0
 [<f0e7180f>] f2fs_mount+0x1f/0x30 [f2fs]
 [<c1173da6>] mount_fs+0x36/0x170
 [<c118b6f5>] vfs_kern_mount+0x55/0xe0
 [<c118d63f>] do_mount+0x1df/0x9f0
 [<c118e110>] SyS_mount+0x70/0xb0
 [<c15a0c48>] sysenter_do_call+0x12/0x12

To avoid page count leak, let's add ->invalidatepage and ->releasepage in
f2fs_meta_aops as f2fs_node_aops to release meta page count correctly.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:33 -08:00
Jaegeuk Kim 85dc2f2c6c f2fs: do checkpoint when umount flag is not set
If the previous checkpoint was done without CP_UMOUNT flag, it needs to do
checkpoint with CP_UMOUNT for the next fast boot.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:33 -08:00
Jaegeuk Kim 30a5537f9a f2fs: trigger correct checkpoint during umount
This patch fixes to trigger checkpoint with umount flag when kill_sb was called.
In kill_sb, f2fs_sync_fs was finally called, but at this time, f2fs can't do
checkpoint with CP_UMOUNT.
After then, f2fs_put_super is not doing checkpoint, since it is not dirty.

So, this patch adds a flag to indicate f2fs_sync_fs is called during umount.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:32 -08:00
Jaegeuk Kim 6f0aacbc3c f2fs: update memory footprint information
This patch adds missing memory usages, and splits them in detail.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:31 -08:00
Chao Yu 9066c6a7eb f2fs: fix wrong memory footprint statistics in debugfs
Our value of memory footprint statistics showed in debugfs is not calculated
correctly. Fix it in this patch.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:31 -08:00
Jaegeuk Kim 871f599f4a f2fs: avoid infinite loop on cp_error
If cp_error is set, we should avoid all the infinite loop.
In f2fs_sync_file, there is a hole, and this patch fixes that.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-02-11 17:04:30 -08:00
kbuild test robot 08e4126e1e f2fs: pids_lock can be static
fs/f2fs/trace.c:19:12: sparse: symbol 'pids_lock' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:29 -08:00
Jaegeuk Kim 351f4fba84 f2fs: add f2fs_destroy_trace_ios to free radix tree
This patch removes radix tree after finishing tracing IOs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim c05086506f f2fs: add spin_lock to cover radix operations in IO tracer
This patch adds spin_lock to cover radix tree operations in IO tracer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim dd4e4b59b1 f2fs: add nat/sit entries into status
This patch adds NAT/SIT entry informations.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim 7aed0d4537 f2fs: free radix_tree_nodes used by nat_set entries
In the normal case, the radix_tree_nodes are freed successfully.
But, when cp_error was detected, we should destroy them forcefully.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim df1991391f f2fs: fix wrong unlock_page call
This patch removes wrongly called unlock_page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Chao Yu 9e5ba77fdb f2fs: get rid of kzalloc in __recover_inline_status
We use kzalloc to allocate memory in __recover_inline_status, and use this
all-zero memory to check the inline date content of inode page by comparing
them. This is low effective and not needed, let's check inline date content
directly.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: make the code more neat]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Jaegeuk Kim 38aa0889b2 f2fs: align direct_io'ed data to section
This patch aligns the start block address of a file for direct io to the f2fs's
section size.

Some flash devices manage an over 4KB-sized page as a write unit, and if the
direct_io'ed data are written but not aligned to that unit, the performance can
be degraded due to the partial page copies.

Thus, since f2fs has a section that is well aligned to FTL units, we can align
the block address to the section size so that f2fs avoids this misalignment.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Jaegeuk Kim 41ef94b35c f2fs: remove uncovered code path
This patch removes unnecessary function calls.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Jaegeuk Kim 3547ea961d f2fs: avoid potential unnecessary codes
This patch relocates some operations to avoid unnecessary execution.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Jaegeuk Kim e1509cf294 f2fs: clean up to remove parameter
This patch uses dn->data_blkaddr as a parameter for the destination block
address.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Chao Yu 062920734c f2fs: reuse inode_entry_slab in gc procedure for using slab more effectively
There are two slab cache inode_entry_slab and winode_slab using the same
structure as below:

struct dir_inode_entry {
	struct list_head list;	/* list head */
	struct inode *inode;	/* vfs inode pointer */
};

struct inode_entry {
	struct list_head list;
	struct inode *inode;
};

It's a little waste that the two cache can not share their memory space for each
other.
So in this patch we remove one redundant winode_slab slab cache, then use more
universal name struct inode_entry as remaining data structure name of slab,
finally we reuse the inode_entry_slab to store dirty dir item and gc item for
more effective.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Chao Yu 2ace38e00e f2fs: cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
Cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
as one parameter.

Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Chao Yu 3e1c8f125e f2fs: cleanup trace event of f2fs_submit_page_{m,}bio with DECLARE_EVENT_CLASS
This patch adds missing parameter _type_ for trace_f2fs_submit_page_bio, then
use DECLARE_EVENT_CLASS/DEFINE_EVENT_CONDITION pair to cleanup some trace event
code related to f2fs_submit_page_{m,}bio.

Additionally, after we remove redundant code, size of code can be reduced:
   text    data     bss     dec     hex filename
 176787    8712      56  185555   2d4d3 f2fs.ko.org
 174408    8648      56  183112   2cb48 f2fs.ko

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Jaegeuk Kim 09eb483e89 f2fs: fix missing cold bit during recovery
In do_recover_data, we find and update previous node pages after updating
its new block addresses.
After then, we call fill_node_footer without reset field, we erase its
cold bit so that this new cold node block is written to wrong log area.
This patch fixes not to miss its old flag.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Changman Lee b9a2c25207 f2fs: add block count by in-place-update in stat info
This patch adds block count by in-place-update in stat.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Jaegeuk Kim dd802406e3 f2fs: avoid double lock for cp_rwsem
The __f2fs_add_link is covered by cp_rwsem all the time.
This calls init_inode_metadata, which conducts some acl operations including
memory allocation with GFP_KERNEL previously.
But, under memory pressure, f2fs_write_data_page can be called, which also
grabs cp_rwsem too.

In this case, this incurs a deadlock pointed by Chao.
Thread #1        Thread #2
 down_read
                 down_write
  down_read
 -> here down_read should wait forever.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Jaegeuk Kim db9f7c1a95 f2fs: activate f2fs_trace_ios
This patch activates f2fs_trace_ios.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim 9e4ded3f30 f2fs: activate f2fs_trace_pid
This patch activates f2fs_trace_pid.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim 0e689d0369 f2fs: add key functions for f2fs_io_tracer
This patch adds two key functions to trace process ids and IOs.
The basic idea is to
1. remain process ids, pids, in page->private.
2. show pids in IO traces.

So, later we can retrieve process information according to IO traces.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim 63f92ddc8a f2fs: add f2fs_io_tracer support
This patch adds:
 o initial trace.c and trace.h with skeleton functions
 o Kconfig and Makefile to activate this feature

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim cf04e8eb55 f2fs: use f2fs_io_info to clean up messy parameters during IO path
This patch cleans up parameters on IO paths.
The key idea is to use f2fs_io_info adding a parameter, block address, and then
use this structure as parameters.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Chao Yu 9ecf4b80bd f2fs: use ra_meta_pages to simplify readahead code in restore_node_summary
Use more common function ra_meta_pages() with META_POR to readahead node blocks
in restore_node_summary() instead of ra_sum_pages(), hence we can simplify the
readahead code there, and also we can remove unused function ra_sum_pages().

changes from v2:
 o use invalidate_mapping_pages as before suggested by Changman Lee.
changes from v1:
 o fix one bug when using truncate_inode_pages_range which is pointed out by
   Jaegeuk Kim.

Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Chao Yu 5c27f4ee44 f2fs: merge two uchar variable in struct node_info to reduce memory cost
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.

changes from v2:
 o update description of memory usage gain for 64-bit machine suggested by
   Changman Lee.
changes from v1:
 o introduce inline copy_node_info() to copy valid data from node info suggested
   by Jaegeuk Kim, it can avoid bug.

Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Chao Yu 3fa06d7bc9 f2fs: readahead contiguous current summary blocks in checkpoint
Let's add readahead code for reading contiguous compact/normal summary blocks
in checkpoint, then we will gain better performance in mount procedure.

Changes from v1
  o remove inappropriate 'unlikely' in npages_for_summary_flush.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Jaegeuk Kim 5df1f1da7a f2fs: use missing the use of f2fs_kunmap_page
This patch calls f2fs_kunmap_page which I missed before.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim 042b7816aa f2fs: remove unnecessary call to invalidate inmemory pages
Now we use inmemory pages for atomic write only and provide abort procedure,
we don't need to truncate them explicitly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim d7bc2484b8 f2fs: fix small discards not to issue redundantly
The ckpt_valid_map and cur_valid_map are synced by seg_info_to_raw_sit.

In the case of small discards, the candidates are selected before sync,
while fitrim selects candidates after sync.

So, for small discards, we need to add candidates only just being obsoleted.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim 1e84371ffe f2fs: change atomic and volatile write policies
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
 o f2fs_ioc_abort_volatile_write
  - If transaction was failed, all the grabbed pages and data should be written.
 o f2fs_ioc_release_volatile_write
  - This is to enhance the performance of PERSIST mode in sqlite.

In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim 70c640b1d6 f2fs: don't need to call lock_op and lock_page for abort
We don't need to call lock_op and lock_page at the aborting path.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:21 -08:00
Jaegeuk Kim 88a70a69c0 f2fs: fix wrong condition check to trigger f2fs_sync_fs
If there is not enough available memory, we need to trigger f2fs_sync_fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:21 -08:00
Jaegeuk Kim cd52b6368f f2fs: remove checking dirty_exceed
We don't need to force to write dirty_exceeded for f2fs_balance_fs_bg.
This flag was only meaningful to write bypassing conditions.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:21 -08:00
Linus Torvalds b3d574aec7 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "12 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
  memcg: fix destination cgroup leak on task charges migration
  mm: memcontrol: switch soft limit default back to infinity
  mm/debug_pagealloc: remove obsolete Kconfig options
  vfs: renumber FMODE_NONOTIFY and add to uniqueness check
  arch/blackfin/mach-bf533/boards/stamp.c: add linux/delay.h
  ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file
  MAINTAINERS: update rydberg's addresses
  mm: protect set_page_dirty() from ongoing truncation
  mm: prevent endless growth of anon_vma hierarchy
  exit: fix race between wait_consider_task() and wait_task_zombie()
  ocfs2: remove bogus check in dlm_process_recovery_data
2015-01-09 15:10:59 -08:00
Vlastimil Babka 9e5e366172 mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed
Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
stuck in a busy loop with nothing left to balance, but
kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
to be a combination of several factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
   scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep():

        if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                wake_up(&pgdat->pfmemalloc_wait);
		return false; // kswapd will not go to sleep
	}

   However, for a process that was already killed, wake_up() does not remove
   the process from the waitqueue, since try_to_wake_up() checks its state
   first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
   allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
   encounters no voluntary preemption points and repeatedly fails
   prepare_kswapd_sleep(), blocking the process from running and removing
   itself from the waitqueue, which would let kswapd sleep.

So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled.  This was done to make sure that no
process is left sleeping on pfmemalloc_wait when kswapd itself goes to
sleep.

However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue actually empties.  To prevent processes from being
left sleeping, it's actually enough to guarantee that all processes
waiting on pfmemalloc_wait queue have been woken up by the time we put
kswapd to sleep.

This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above.  Note that if any process puts itself in
the queue after this waitqueue_active() check, or after the wake up
itself, it means that the process will also wake up kswapd - and since
we are under prepare_to_wait(), the wake up won't be missed.  Also we
update the comment prepare_kswapd_sleep() to hopefully more clearly
describe the races it is preventing.

Fixes: 5515061d22 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>	[3.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08 15:10:52 -08:00