jbd2_alloc is explicit about its allocation preferences wrt. the
allocation size. Sub page allocations go to the slab allocator and
larger are using either the page allocator or vmalloc. This is all good
but the logic is unnecessarily complex.
1) as per Ted, the vmalloc fallback is a left-over:
: jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
: the code path that calls vmalloc() should never get called. When we
: conveted jbd2_alloc() to suppor sub-page size allocations in commit
: d2eecb0393, there was an assumption that it could be called with a size
: greater than PAGE_SIZE, but that's certaily not true today.
Moreover vmalloc allocation might even lead to a deadlock because the
callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.
2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
since the flag was introduced.
Let's simplify the code flow and use the slab allocator for sub-page
requests and the page allocator for others. Even though order > 0 is
not currently used as per above leave that option open.
Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXQ40fAAoJEPL5WVaVDYGjnwMH+wXHASgPfzZgtRInsTG8W/2L
jsmAcMlyMAYIATWMppNtPIq0td49z1dYO0YkKhtPVMwfzu230IFWhGWp93WqP9ve
XYHMmaBorFlMAzWgMKn1K0ExWZlV+ammmcTKgU0kU4qyZp0G/NnMtlXIkSNv2amI
9Mn6R+v97c20gn8e9HWP/IVWkgPr+WBtEXaSGjC7dL6yI8hL+rJMqN82D76oU5ea
vtwzrna/ISijy+etYmQzqHNYNaBKf40+B5HxQZw/Ta3FSHofBwXAyLaeEAr260Mf
V3Eg2NDcKQxiZ3adBzIUvrRnrJV381OmHoguo8Frs8YHTTRiZ0T/s7FGr2Q0NYE=
=7yIM
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: pre-zero allocated blocks for DAX IO
ext4: refactor direct IO code
ext4: fix race in transient ENOSPC detection
ext4: handle transient ENOSPC properly for DAX
dax: call get_blocks() with create == 1 for write faults to unwritten extents
ext4: remove unmeetable inconsisteny check from ext4_find_extent()
jbd2: remove excess descriptions for handle_s
ext4: remove unnecessary bio get/put
ext4: silence UBSAN in ext4_mb_init()
ext4: address UBSAN warning in mb_find_order_for_block()
ext4: fix oops on corrupted filesystem
ext4: fix check of dqget() return value in ext4_ioctl_setproject()
ext4: clean up error handling when orphan list is corrupted
ext4: fix hang when processing corrupted orphaned inode list
ext4: remove trailing \n from ext4_warning/ext4_error calls
ext4: fix races between changing inode journal mode and ext4_writepages
ext4: handle unwritten or delalloc buffers before enabling data journaling
ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
ext4: do not ask jbd2 to write data for delalloc buffers
jbd2: add support for avoiding data writes during transaction commits
...
Currently when filesystem needs to make sure data is on permanent
storage before committing a transaction it adds inode to transaction's
inode list. During transaction commit, jbd2 writes back all dirty
buffers that have allocated underlying blocks and waits for the IO to
finish. However when doing writeback for delayed allocated data, we
allocate blocks and immediately submit the data. Thus asking jbd2 to
write dirty pages just unnecessarily adds more work to jbd2 possibly
writing back other redirtied blocks.
Add support to jbd2 to allow filesystem to ask jbd2 to only wait for
outstanding data writes before committing a transaction and thus avoid
unnecessary writes.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch fix spelling typos found in DocBook/filesystem.xml.
It is because the file was generated from comments in code,
I have to fix the comments in codes, instead of xml file.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Journal transaction might fail prematurely because the frozen_buffer
is allocated by GFP_NOFS request:
[ 72.440013] do_get_write_access: OOM for frozen_buffer
[ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory
(...snipped....)
[ 72.495559] do_get_write_access: OOM for frozen_buffer
[ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.496839] do_get_write_access: OOM for frozen_buffer
[ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access
[ 72.505766] Aborting journal on device sda1-8.
[ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only
This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS
allocations upon OOM" because small GPF_NOFS allocations never failed.
This allocation seems essential for the journal and GFP_NOFS is too
restrictive to the memory allocator so let's use __GFP_NOFAIL here to
emulate the previous behavior.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
On umount path, jbd2_journal_destroy() writes latest transaction ID
(->j_tail_sequence) to be used at next mount.
The bug is that ->j_tail_sequence is not holding latest transaction ID
in some cases. So, at next mount, there is chance to conflict with
remaining (not overwritten yet) transactions.
mount (id=10)
write transaction (id=11)
write transaction (id=12)
umount (id=10) <= the bug doesn't write latest ID
mount (id=10)
write transaction (id=11)
crash
mount
[recovery process]
transaction (id=11)
transaction (id=12) <= valid transaction ID, but old commit
must not replay
Like above, this bug become the cause of recovery failure, or FS
corruption.
So why ->j_tail_sequence doesn't point latest ID?
Because if checkpoint transactions was reclaimed by memory pressure
(i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated.
(And another case is, __jbd2_journal_clean_checkpoint_list() is called
with empty transaction.)
So in above cases, ->j_tail_sequence is not pointing latest
transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not
done too.
So, to fix this problem with minimum changes, this patch updates
->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes,
some optimizations would be possible to avoid unnecessary REQ_FLUSH
for example though.)
BTW,
journal->j_tail_sequence =
++journal->j_transaction_sequence;
Increment of ->j_transaction_sequence seems to be unnecessary, but
ext3 does this.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently we used atomic bit operations to manipulate
__JI_COMMIT_RUNNING bit. However this is unnecessary as i_flags are
always written and read under j_list_lock. So just change the operations
to standard bit operations.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Revoke and tag descriptor blocks are just different kinds of descriptor
blocks and thus have checksum in the same place. Unify computation and
checking of checksums for these.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Descriptor block header is initialized in several places. Factor out the
common code into jbd2_journal_get_descriptor_buffer().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_journal_write_revoke_records() takes journal pointer and write_op,
although journal can be obtained from the passed transaction and
write_op is always WRITE_SYNC. Remove these superfluous arguments.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
some endian conversion problems with ext4 encryption, potential memory
leaks after truncate in data=journal mode, and an ocfs2 regression
caused by a jbd2 performance improvement.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJWZP5/AAoJEPL5WVaVDYGjLUwH+wZUMghNAiR+AUEDW5MwbRMd
XGvPFr1ohs9ep6vrayRLAVU+BDbcSvL7GRddFpUabTfDqH6tyk43IqOFaE2UMefj
+k61vUnEaD7hTOGtgnfkntAcrE8hngTA1UkEGRTRQjRSqKgt1ku2LB/GfGbgv4Yq
1grwLbq/MRZ6gSZIv8TwuLC7BOAt6mLtxtJB8ozRuhVJVo1gzy1IvXkqn2mgf/h4
nIq6i720NxZS9HqncA/o1rxfWb2bwEpLj5MYqdgscwHBGql0riHM0HOebAbgvVqV
FgLM81p4qVry8N4ACNLIJsqjWu3eMcNUZFW3vaObGTg2tODTj8WSEClb56Z6fpE=
=iOzG
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Ext4 bug fixes for v4.4, including fixes for post-2038 time encodings,
some endian conversion problems with ext4 encryption, potential memory
leaks after truncate in data=journal mode, and an ocfs2 regression
caused by a jbd2 performance improvement"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: fix null committed data return in undo_access
ext4: add "static" to ext4_seq_##name##_fops struct
ext4: fix an endianness bug in ext4_encrypted_follow_link()
ext4: fix an endianness bug in ext4_encrypted_zeroout()
jbd2: Fix unreclaimed pages after truncate in data=journal mode
ext4: Fix handling of extended tv_sec
Ted and Namjae have reported that truncated pages don't get timely
reclaimed after being truncated in data=journal mode. The following test
triggers the issue easily:
for (i = 0; i < 1000; i++) {
pwrite(fd, buf, 1024*1024, 0);
fsync(fd);
fsync(fd);
ftruncate(fd, 0);
}
The reason is that journal_unmap_buffer() finds that truncated buffers
are not journalled (jh->b_transaction == NULL), they are part of
checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
been already written out (!buffer_dirty(bh)). We clean such buffers but
we leave them in the checkpoint list. Since checkpoint transaction holds
a reference to the journal head, these buffers cannot be released until
the checkpoint transaction is cleaned up. And at that point we don't
call release_buffer_page() anymore so pages detached from mapping are
lingering in the system waiting for reclaim to find them and free them.
Fix the problem by removing buffers from transaction checkpoint lists
when journal_unmap_buffer() finds out they don't have to be there
anymore.
Reported-and-tested-by: Namjae Jeon <namjae.jeon@samsung.com>
Fixes: de1b794130
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Merge second patch-bomb from Andrew Morton:
- most of the rest of MM
- procfs
- lib/ updates
- printk updates
- bitops infrastructure tweaks
- checkpatch updates
- nilfs2 update
- signals
- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option. But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.
Task A Task B
ext4_handle_error()
-> jbd2_journal_abort()
-> __journal_abort_soft()
-> __jbd2_journal_abort_hard()
| -> journal->j_flags |= JBD2_ABORT;
|
| __ext4_abort()
| -> jbd2_journal_abort()
| | -> __journal_abort_soft()
| | -> if (journal->j_flags & JBD2_ABORT)
| | return;
| -> panic()
|
-> jbd2_journal_update_sb_errno()
Tested-by: Hobin Woo <hobin.woo@samsung.com>
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Unlike comments and expectation of callers journal_clean_one_cp_list()
returned 1 not only if it freed the transaction but also if it freed
some buffers in the transaction. That could make
__jbd2_journal_clean_checkpoint_list() skip processing
t_checkpoint_io_list and continue with processing the next transaction.
This is mostly a cosmetic issue since the only result is we can
sometimes free less memory than we could. But it's still worth fixing.
Fix journal_clean_one_cp_list() to return 1 only if the transaction was
really freed.
Fixes: 50849db32a
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of overloading EIO for CRC errors and corrupt structures,
return the same error codes that XFS returns for the same issues.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Change the journal's checksum functions to gate on whether or not the
crc32c driver is loaded, and gate the loading on the superblock bits.
This prevents a journal crash if someone loads a journal in no-csum
mode and then randomizes the superblock, thus flipping on the feature
bits.
Tested-By: Nikolay Borisov <kernel@kyup.com>
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently there is no limitation on number of reserved credits we can
ask for. If we ask for more reserved credits than 1/2 of maximum
transaction size, or if total number of credits exceeds the maximum
transaction size per operation (which is currently only possible with
the former) we will spin forever in start_this_handle().
Fix this by adding this limitation at the start of start_this_handle().
This patch also removes the credit limitation 1/2 of maximum transaction
size, since we really only want to limit the number of reserved credits.
There is not much point to limit the credits if there is still space in
the journal.
This accidentally also fixes the online resize, where due to the
limitation of the journal credits we're unable to grow file systems with
1k block size and size between 16M and 32M. It has been partially fixed
by 2c869b262a, but not entirely.
Thanks Jan Kara for helping me getting the correct fix.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 6f6a6fda29 "jbd2: fix ocfs2 corrupt when updating journal
superblock fails" changed jbd2_cleanup_journal_tail() to return EIO
when the journal is aborted. That makes logic in
jbd2_log_do_checkpoint() bail out which is fine, except that
jbd2_journal_destroy() expects jbd2_log_do_checkpoint() to always make
a progress in cleaning the journal. Without it jbd2_journal_destroy()
just loops in an infinite loop.
Fix jbd2_journal_destroy() to cleanup journal checkpoint lists of
jbd2_log_do_checkpoint() fails with error.
Reported-by: Eryu Guan <guaneryu@gmail.com>
Tested-by: Eryu Guan <guaneryu@gmail.com>
Fixes: 6f6a6fda29
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When an error condition is detected, an error status should be recorded into
superblocks of EXT4 or JBD2. However, the write request is submitted now
without REQ_FUA flag, even in "barrier=1" mode, which is followed by
panic() function in "errors=panic" mode. On mobile devices which make
whole system reset as soon as kernel panic occurs, this write request
containing an error flag will disappear just from storage cache without
written to the physical cells. Therefore, when next start, even forever,
the error flag cannot be shown in both superblocks, and e2fsck cannot fix
the filesystem problems automatically, unless e2fsck is executed in
force checking mode.
[ Changed use test_opt(sb, BARRIER) of checking the journal flags -- TYT ]
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.
[ Note: this is a fixed version of commit 2143c1965a, which was
reverted in ebeaa8ddb3 due to a false positive triggering of an
assertion check. -- Ted ]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This reverts commit 2143c1965a.
This commit seems to be the cause of the following jbd2 assertion
failure:
------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1325!
invalid opcode: 0000 [#1] SMP
Modules linked in: bnep bluetooth fuse ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 ...
CPU: 7 PID: 5509 Comm: gcc Not tainted 4.1.0-10944-g2a298679b411 #1
Hardware name: /DH87RL, BIOS RLH8710H.86A.0327.2014.0924.1645 09/24/2014
task: ffff8803bf866040 ti: ffff880308528000 task.ti: ffff880308528000
RIP: jbd2_journal_dirty_metadata+0x237/0x290
Call Trace:
__ext4_handle_dirty_metadata+0x43/0x1f0
ext4_handle_dirty_dirent_node+0xde/0x160
? jbd2_journal_get_write_access+0x36/0x50
ext4_delete_entry+0x112/0x160
? __ext4_journal_start_sb+0x52/0xb0
ext4_unlink+0xfa/0x260
vfs_unlink+0xec/0x190
do_unlinkat+0x24a/0x270
SyS_unlink+0x11/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
---[ end trace ae033ebde8d080b4 ]---
which is not easily reproducible (I've seen it just once, and then Ted
was able to reproduce it once). Revert it while Ted and Jan try to
figure out what is wrong.
Cc: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge second patchbomb from Andrew Morton:
- most of the rest of MM
- lots of misc things
- procfs updates
- printk feature work
- updates to get_maintainer, MAINTAINERS, checkpatch
- lib/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
exit,stats: /* obey this comment */
coredump: add __printf attribute to cn_*printf functions
coredump: use from_kuid/kgid when formatting corename
fs/reiserfs: remove unneeded cast
NILFS2: support NFSv2 export
fs/befs/btree.c: remove unneeded initializations
fs/minix: remove unneeded cast
init/do_mounts.c: add create_dev() failure log
kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
fs/efs: femove unneeded cast
checkpatch: emit "NOTE: <types>" message only once after multiple files
checkpatch: emit an error when there's a diff in a changelog
checkpatch: validate MODULE_LICENSE content
checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
checkpatch: fix processing of MEMSET issues
checkpatch: suggest using ether_addr_equal*()
checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
checkpatch: remove local from codespell path
checkpatch: add --showfile to allow input via pipe to show filenames
...
In one case, we eliminate a local variable; in the other a strlen()
call and some .text.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
insert_revoke_hash does an open coded endless allocation loop if
journal_oom_retry is true. It doesn't implement any allocation fallback
strategy between the retries, though. The memory allocator doesn't know
about the never fail requirement so it cannot potentially help to move
on with the allocation (e.g. use memory reserves).
Get rid of the retry loop and use __GFP_NOFAIL instead. We will lose the
debugging message but I am not sure it is anyhow helpful.
Do the same for journal_alloc_journal_head which is doing a similar
thing.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If updating journal superblock fails after journal data has been
flushed, the error is omitted and this will mislead the caller as a
normal case. In ocfs2, the checkpoint will be treated successfully
and the other node can get the lock to update. Since the sb_start is
still pointing to the old log block, it will rewrite the journal data
during journal recovery by the other node. Thus the new updates will
be overwritten and ocfs2 corrupts. So in above case we have to return
the error, and ocfs2_commit_cache will take care of the error and
prevent the other node to do update first. And only after recovering
journal it can do the new updates.
The issue discussion mail can be found at:
https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.htmlhttp://comments.gmane.org/gmane.comp.file-systems.ext4/48841
[ Fixed bug in patch which allowed a non-negative error return from
jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this
was causing xfstests ext4/306 to fail. -- Ted ]
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Tested-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: stable@vger.kernel.org
jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
frequently called for buffers that are already part of the running
transaction - most frequently it is the case for bitmaps, inode table
blocks, and superblock. Since in such cases we have nothing to do, it is
unfortunate we still grab reference to journal head, lock the bh, lock
bh_state only to find out there's nothing to do.
Improving this is a bit subtle though since until we find out journal
head is attached to the running transaction, it can disappear from under
us because checkpointing / commit decided it's no longer needed. We deal
with this by protecting journal_head slab with RCU. We still have to be
careful about journal head being freed & reallocated within slab and
about exposing journal head in consistent state (in particular
b_modified and b_frozen_data must be in correct state before we allow
user to touch the buffer).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Check for the simple case of unjournaled buffer first, handle it and
bail out. This allows us to remove one if and unindent the difficult case
by one tab. The result is easier to read.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We were acquiring bh_state_lock when allocation of buffer failed in
do_get_write_access() only to be able to jump to a label that releases
the lock and does all other checks that don't make sense for this error
path. Just jump into the right label instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
needs_copy is set only in one place in do_get_write_access(), just move
the frozen buffer copying into that place and factor it out to a
separate function to make do_get_write_access() slightly more readable.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This basically reverts 47def82672 (jbd2: Remove __GFP_NOFAIL from jbd2
layer). The deprecation of __GFP_NOFAIL was a bad choice because it led
to open coding the endless loop around the allocator rather than
removing the dependency on the non failing allocation. So the
deprecation was a clear failure and the reality tells us that
__GFP_NOFAIL is not even close to go away.
It is still true that __GFP_NOFAIL allocations are generally discouraged
and new uses should be evaluated and an alternative (pre-allocations or
reservations) should be considered but it doesn't make any sense to lie
the allocator about the requirements. Allocator can take steps to help
making a progress if it knows the requirements.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Acked-by: David Rientjes <rientjes@google.com>
The journal revoke block recovery code does not check r_count for
sanity, which means that an evil value of r_count could result in
the kernel reading off the end of the revoke table and into whatever
garbage lies beyond. This could crash the kernel, so fix that.
However, in testing this fix, I discovered that the code to write
out the revoke tables also was not correctly checking to see if the
block was full -- the current offset check is fine so long as the
revoke table space size is a multiple of the record size, but this
is not true when either journal_csum_v[23] are set.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Currently when journal restart fails, we'll have the h_transaction of
the handle set to NULL to indicate that the handle has been effectively
aborted. We handle this situation quietly in the jbd2_journal_stop() and just
free the handle and exit because everything else has been done before we
attempted (and failed) to restart the journal.
Unfortunately there are a number of problems with that approach
introduced with commit
41a5b91319 "jbd2: invalidate handle if jbd2_journal_restart()
fails"
First of all in ext4 jbd2_journal_stop() will be called through
__ext4_journal_stop() where we would try to get a hold of the superblock
by dereferencing h_transaction which in this case would lead to NULL
pointer dereference and crash.
In addition we're going to free the handle regardless of the refcount
which is bad as well, because others up the call chain will still
reference the handle so we might potentially reference already freed
memory.
Moreover it's expected that we'll get aborted handle as well as detached
handle in some of the journalling function as the error propagates up
the stack, so it's unnecessary to call WARN_ON every time we get
detached handle.
And finally we might leak some memory by forgetting to free reserved
handle in jbd2_journal_stop() in the case where handle was detached from
the transaction (h_transaction is NULL).
Fix the NULL pointer dereference in __ext4_journal_stop() by just
calling jbd2_journal_stop() quietly as suggested by Jan Kara. Also fix
the potential memory leak in jbd2_journal_stop() and use proper
handle refcounting before we attempt to free it to avoid use-after-free
issues.
And finally remove all WARN_ON(!transaction) from the code so that we do
not get random traces when something goes wrong because when journal
restart fails we will get to some of those functions.
Cc: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
We should complain in dmesg when journal recovery fails on account of
the descriptor block being corrupt, so that the diagnostic data can
be recovered.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
fixes, which should improve CPU utilization and potential soft lockups
under heavy memory pressure, and Eric Whitney's bigalloc fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJUiRUwAAoJENNvdpvBGATwltQP/3sjHtFw+RUvKgQ8vX9M2THk
4b9j0ja0mrD3ObTXUxdDuOh1q09MsfSUiOYK6KZOav3nO/dRODqZnWgXz/zJt3LC
R97s4velgzZi3F2ijnLiCo5RVZahN9xs8bUHZ85orMIr5wogwGdaUpnoqZSg0Ehr
PIFnTNORyNXBwEm3XPjUmENTdyq9FZ8DsS6ACFzgFi79QTSyJFEM4LAl2XaqwMGV
fVhNwnOGIyT8lHZAtDcobkaC86NjakmpW2Ip3p9/UEQtynh16UeVXKEO3K7CcQ+L
YJRDNnSIlGpR1OJp+v6QJPUd8q4fc/8JW9AxxsLak0eqkszuB+MxoQXOCFV5AWaf
jrs4TV3y0hCuB4OwuYUpnfcU1o+O7p39MqXMv8SA1ZBPbijN/LQSMErFtXj2oih6
3gJHUWLwELGeR+d9JlI29zxhOeOIotX255UBgj2oasQ0X3BW3qAgQ4LmP3QY90Pm
BUmxiMoIWB9N3kU4XQGf+Kyy8JeMLJj0frHDxI3XLz+B+IlWCCkBH6y3AD/a13kS
HHMMLOwHGEs0lYEKsm89dkcij5GuKd8eKT8Q0+CvKD9Z6HPdYvQxoazmF87Q6j/7
ZmshaVxtWaLpNbDaXVg+IgZifJAN0+mVzVHRhY9TSjx8k9qLdSgSEqYWjkSjx9Ij
nNB2zVrHZDMvZ7MCZy85
=ZrTc
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
under heavy memory pressure, and Eric Whitney's bigalloc fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
ext4: fix suboptimal seek_{data,hole} extents traversial
ext4: ext4_inline_data_fiemap should respect callers argument
ext4: prevent fsreentrance deadlock for inline_data
ext4: forbid journal_async_commit in data=ordered mode
jbd2: remove unnecessary NULL check before iput()
ext4: Remove an unnecessary check for NULL before iput()
ext4: remove unneeded code in ext4_unlink
ext4: don't count external journal blocks as overhead
ext4: remove never taken branch from ext4_ext_shift_path_extents()
ext4: create nojournal_checksum mount option
ext4: update comments regarding ext4_delete_inode()
ext4: cleanup GFP flags inside resize path
ext4: introduce aging to extent status tree
ext4: cleanup flag definitions for extent status tree
ext4: limit number of scanned extents in status tree shrinker
ext4: move handling of list of shrinkable inodes into extent status code
ext4: change LRU to round-robin in extent status tree shrinker
ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
ext4: fix block reservation for bigalloc filesystems
...
When we're enabling journal features, we cannot use the predicate
jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
feature flag fields! Moreover, we just finished loading the shash
driver, so the test is unnecessary; calculate the seed always.
Without this patch, we fail to initialize the checksum seed the first
time we turn on journal_checksum, which means that all journal blocks
written during that first mount are corrupt. Transactions written
after the second mount will be fine, since the feature flag will be
set in the journal superblock. xfstests generic/{034,321,322} are the
regression tests.
(This is important for 3.18.)
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The old hash function didn't work well for 64-bit block numbers, and
used undefined (negative) shift right behavior. Use the generic
64-bit hash function instead.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096
Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.
Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If EIO happens after we have dropped j_state_lock, we won't notice
that the journal has been aborted. So it is reasonable to move this
check after we have grabbed the j_checkpoint_mutex and re-grabbed the
j_state_lock. This patch helps to prevent false positive complain
after EIO.
#DMESG:
__jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available
__jbd2_log_wait_for_space: no way to get more journal space in ram1-8
------------[ cut here ]------------
WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200()
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 15 PID: 6739 Comm: fsstress Tainted: G W 3.17.0-rc2-00429-g684de57 #139
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8
0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898
ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0
Call Trace:
[<ffffffff815c1a8c>] dump_stack+0x51/0x6d
[<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20
[<ffffffff812419f8>] __jbd2_log_wait_for_space+0x188/0x200
[<ffffffff8123be9a>] start_this_handle+0x4da/0x7b0
[<ffffffff810990e5>] ? local_clock+0x25/0x30
[<ffffffff810aba87>] ? lockdep_init_map+0xe7/0x180
[<ffffffff8123c5bc>] jbd2__journal_start+0xdc/0x1d0
[<ffffffff811f2414>] ? __ext4_new_inode+0x7f4/0x1330
[<ffffffff81222a38>] __ext4_journal_start_sb+0xf8/0x110
[<ffffffff811f2414>] __ext4_new_inode+0x7f4/0x1330
[<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
[<ffffffff812025bb>] ext4_create+0x8b/0x150
[<ffffffff8117fe3b>] vfs_create+0x7b/0xb0
[<ffffffff8118097b>] do_last+0x7db/0xcf0
[<ffffffff8117e31d>] ? inode_permission+0x4d/0x50
[<ffffffff811845d2>] path_openat+0x242/0x590
[<ffffffff81191a76>] ? __alloc_fd+0x36/0x140
[<ffffffff81184a6a>] do_filp_open+0x4a/0xb0
[<ffffffff81191b61>] ? __alloc_fd+0x121/0x140
[<ffffffff81172f20>] do_sys_open+0x170/0x220
[<ffffffff8117300e>] SyS_open+0x1e/0x20
[<ffffffff811715d6>] SyS_creat+0x16/0x20
[<ffffffff815c7e12>] system_call_fastpath+0x16/0x1b
---[ end trace cd71c831f82059db ]---
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Free the buffer head if the journal descriptor block fails checksum
verification.
This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum
verify error in do_one_pass".
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org