Commit Graph

798554 Commits

Author SHA1 Message Date
Filipe Manana 827aa18e7b Btrfs: use nofs context when initializing security xattrs to avoid deadlock
When initializing the security xattrs, we are holding a transaction handle
therefore we need to use a GFP_NOFS context in order to avoid a deadlock
with reclaim in case it's triggered.

Fixes: 39a27ec100 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:49 +01:00
Josef Bacik 0568e82dbe btrfs: run delayed items before dropping the snapshot
With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent:

 BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964  owner 1 offset 0
 Call Trace:
  ? btrfs_merge_delayed_refs+0xaf/0x340
  __btrfs_run_delayed_refs+0x6ea/0xfc0
  ? btrfs_set_path_blocking+0x31/0x60
  btrfs_run_delayed_refs+0xeb/0x180
  btrfs_commit_transaction+0x179/0x7f0
  ? btrfs_check_space_for_delayed_refs+0x30/0x50
  ? should_end_transaction.isra.19+0xe/0x40
  btrfs_drop_snapshot+0x41c/0x7c0
  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
  cleaner_kthread+0xf6/0x120
  kthread+0xf8/0x130
  ? btree_invalidatepage+0x90/0x90
  ? kthread_bind+0x10/0x10
  ret_from_fork+0x35/0x40

This was because btrfs_drop_snapshot depends on the root not being
modified while it's dropping the snapshot.  It will unlock the root node
(and really every node) as it walks down the tree, only to re-lock it
when it needs to do something.  This is a problem because if we modify
the tree we could cow a block in our path, which frees our reference to
that block.  Then once we get back to that shared block we'll free our
reference to it again, and get ENOENT when trying to lookup our extent
reference to that block in __btrfs_free_extent.

This is ultimately happening because we have delayed items left to be
processed for our deleted snapshot _after_ all of the inodes are closed
for the snapshot.  We only run the delayed inode item if we're deleting
the inode, and even then we do not run the delayed insertions or delayed
removals.  These can be run at any point after our final inode does its
last iput, which is what triggers the snapshot deletion.  We can end up
with the snapshot deletion happening and then have the delayed items run
on that file system, resulting in the above problem.

This problem has existed forever, however my patches made it much easier
to hit as I wake up the cleaner much more often to deal with delayed
iputs, which made us more likely to start the snapshot dropping work
before the transaction commits, which is when the delayed items would
generally be run.  Before, generally speaking, we would run the delayed
items, commit the transaction, and wakeup the cleaner thread to start
deleting snapshots, which means we were less likely to hit this problem.
You could still hit it if you had multiple snapshots to be deleted and
ended up with lots of delayed items, but it was definitely harder.

Fix for now by simply running all the delayed items before starting to
drop the snapshot.  We could make this smarter in the future by making
the delayed items per-root, and then simply drop any delayed items for
roots that we are going to delete.  But for now just a quick and easy
solution is the safest.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:49 +01:00
Josef Bacik 83354f0772 btrfs: catch cow on deleting snapshots
When debugging some weird extent reference bug I suspected that we were
changing a snapshot while we were deleting it, which could explain my
bug.  This was indeed what was happening, and this patch helped me
verify my theory.  It is never correct to modify the snapshot once it's
being deleted, so mark the root when we are deleting it and make sure we
complain about it when it happens.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:48 +01:00
Qu Wenruo 01e0da4885 btrfs: extent-tree: cleanup one-shot usage of @blocksize in do_walk_down
@blocksize variable in do_walk_down() is only used once, really no need
to declare it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:48 +01:00
Filipe Manana 7c3c7cb99c Btrfs: scrub, move setup of nofs contexts higher in the stack
Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:

 scrub_bio_end_io_worker()
   scrub_block_complete()
     scrub_handle_errored_block()
       lock_full_stripe()
         insert_full_stripe_lock()
           -> kmalloc with GFP_KERNEL

  scrub_bio_end_io_worker()
    scrub_block_complete()
      scrub_handle_errored_block()
        scrub_write_page_to_dev_replace()
          scrub_add_page_to_wr_bio()
            -> kzalloc with GFP_KERNEL

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:48 +01:00
David Sterba 0e94c4f45d btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex
The scrub context is allocated with GFP_KERNEL and called from
btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
regarding reclaim that could try to flush filesystem data in order to
get the memory. And the device_list_mutex is held during superblock
commit, so this would cause a lockup.

Move the alocation and initialization before any changes that require
the mutex.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:48 +01:00
David Sterba 92f7ba434f btrfs: scrub: pass fs_info to scrub_setup_ctx
We can pass fs_info directly as this is the only member of btrfs_device
that's bing used inside scrub_setup_ctx.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:48 +01:00
Josef Bacik 28bad21257 btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:47 +01:00
Josef Bacik db2462a6ad btrfs: don't run delayed refs in the end transaction logic
Over the years we have built up a lot of infrastructure to keep delayed
refs in check, mostly by running them at btrfs_end_transaction() time.
We have a lot of different maths we do to figure out how much, if we
should do it inline or async, etc.  This existed because we had no
feedback mechanism to force the flushing of delayed refs when they
became a problem.  However with the enospc flushing infrastructure in
place for flushing delayed refs when they put too much pressure on the
enospc system we have this problem solved.  Rip out all of this code as
it is no longer needed.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:47 +01:00
Josef Bacik 64403612b7 btrfs: rework btrfs_check_space_for_delayed_refs
Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need.  This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv.  If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:47 +01:00
Josef Bacik 413df7252d btrfs: add new flushing states for the delayed refs rsv
A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure.  Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:47 +01:00
Josef Bacik 4c8edbc75c btrfs: update may_commit_transaction to use the delayed refs rsv
Any space used in the delayed_refs_rsv will be freed up by a transaction
commit, so instead of just counting the pinned space we also need to
account for any space in the delayed_refs_rsv when deciding if it will
make a different to commit the transaction to satisfy our space
reservation.  If we have enough bytes to satisfy our reservation ticket
then we are good to go, otherwise subtract out what space we would gain
back by committing the transaction and compare that against the pinned
space to make our decision.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:47 +01:00
Josef Bacik ba2c4d4e3b btrfs: introduce delayed_refs_rsv
Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Historical background:

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
if we're running short on space and when it's time to force a commit.  A
failure rate of 20-40 file systems when we run hundreds of thousands of
them isn't super high, but cleaning up this code will make things less
ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the
global reserve for that spill over, which fits better into it's desired
use than a full blown reservation.  This first approach is to simply
take how many times we're reserving space for and multiply that by 2 in
order to save enough space for the delayed refs that could be generated.
This is a niave approach and will probably evolve, but for now it works.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
[ added background notes from the cover letter ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Josef Bacik 158ffa364b btrfs: only track ref_heads in delayed_ref_updates
We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Josef Bacik bedc661760 btrfs: cleanup extent_op handling
The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Josef Bacik 07c47775f4 btrfs: add cleanup_ref_head_accounting helper
We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Josef Bacik d7baffdaf9 btrfs: add btrfs_delete_ref_head helper
We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:46 +01:00
Johannes Thumshirn fdb1e12180 btrfs: use PAGE_ALIGNED instead of open-coding it
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.

Convert all open-coded occurrences of PAGE_ALIGNED().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:45 +01:00
Johannes Thumshirn 7073017aeb btrfs: use offset_in_page instead of open-coding it
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.

So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:45 +01:00
David Sterba cb5583dd52 btrfs: dev-replace: open code trivial locking helpers
The dev-replace locking functions are now trivial wrappers around rw
semaphore that can be used directly everywhere. No functional change.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:45 +01:00
David Sterba 53176dde0a btrfs: dev-replace: remove custom read/write blocking scheme
After the rw semaphore has been added, the custom blocking using
::blocking_readers and ::read_lock_wq is redundant.

The blocking logic in __btrfs_map_block is replaced by extending the
time the semaphore is held, that has the same blocking effect on writes
as the previous custom scheme that waited until ::blocking_readers was
zero.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:45 +01:00
David Sterba 129827e300 btrfs: dev-replace: swich locking to rw semaphore
This is the first part of removing the custom locking and waiting scheme
used for device replace. It was probably copied from extent buffer
locking, but there's nothing that would require more than is provided by
the common locking primitives.

The rw spinlock protects waiting tasks counter in case of incompatible
locks and the waitqueue. Same as rw semaphore.

This patch only switches the locking primitive, for better
bisectability.  There should be no functional change other than the
overhead of the locking and potential sleeping instead of spinning when
the lock is contended.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:44 +01:00
David Sterba ceb21a8db4 btrfs: reada: reorder dev-replace locks before radix tree preload
The device-replace read lock is going to use rw semaphore in followup
commits. The semaphore might sleep which is not possible in the radix
tree preload section. The lock nesting is now:

* device replace
  * radix tree preload
    * readahead spinlock

Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:44 +01:00
Nikolay Borisov d1051d6ebf btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:

	btrfs           D    0  5760   5324 0x00000000
	Call Trace:
	 ? __schedule+0x243/0x800
	 schedule+0x33/0x90
	 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
	 ? wait_woken+0xa0/0xa0
	 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
	 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
	 btrfs_relocate_chunk+0x49/0x100 [btrfs]
	 btrfs_balance+0xbeb/0x1740 [btrfs]
	 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
	 btrfs_ioctl+0x1691/0x3110 [btrfs]
	 ? lockdep_hardirqs_on+0xed/0x180
	 ? __handle_mm_fault+0x8e7/0xfb0
	 ? _raw_spin_unlock+0x24/0x30
	 ? __handle_mm_fault+0x8e7/0xfb0
	 ? do_vfs_ioctl+0xa5/0x6e0
	 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
	 do_vfs_ioctl+0xa5/0x6e0
	 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
	 ksys_ioctl+0x3a/0x70
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x60/0x1b0
	 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't belong
to the page currently being written back.

The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignoring whether the page under writeback is within that range.

In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.

Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed and skipped by the current cleaning up code.

Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters passed
for cleaning up. If it doesn't, then just clean the whole OE range
directly.

Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:44 +01:00
Lu Fengqi 3522e90301 btrfs: remove always true if branch in find_delalloc_range
The @found is always false when it comes to the if branch. Besides, the
bool type is more suitable for @found. Change the return value of the
function and its caller to bool as well.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:44 +01:00
Lu Fengqi 27a7ff554e btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow
The test case btrfs/001 with inode_cache mount option will encounter the
following warning:

  WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
  CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G        W  O      4.20.0-rc4-custom+ #30
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
  Call Trace:
   ? free_extent_buffer+0x46/0x90 [btrfs]
   run_delalloc_nocow+0x455/0x900 [btrfs]
   btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
   writepage_delalloc+0xf9/0x150 [btrfs]
   __extent_writepage+0x125/0x3e0 [btrfs]
   extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
   ? __wake_up_common_lock+0x63/0xc0
   extent_writepages+0x50/0x80 [btrfs]
   do_writepages+0x41/0xd0
   ? __filemap_fdatawrite_range+0x9e/0xf0
   __filemap_fdatawrite_range+0xbe/0xf0
   btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
   __btrfs_write_out_cache+0x42c/0x480 [btrfs]
   btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
   btrfs_save_ino_cache+0x551/0x660 [btrfs]
   commit_fs_roots+0xc5/0x190 [btrfs]
   btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
   btrfs_mksubvol+0x48d/0x4d0 [btrfs]
   btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
   btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
   btrfs_ioctl+0x123f/0x3030 [btrfs]

The file extent generation of the free space inode is equal to the last
snapshot of the file root, so the inode will be passed to cow_file_rage.
But the inode was created and its extents were preallocated in
btrfs_save_ino_cache, there are no cow copies on disk.

The preallocated extent is not yet in the extent tree, and
btrfs_cross_ref_exist will ignore the -ENOENT returned by
check_committed_ref, so we can directly write the inode to the disk.

Fixes: 78d4295b1e ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
CC: stable@vger.kernel.org # 4.18+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:44 +01:00
Filipe Manana 41bd606769 Btrfs: fix fsync of files with multiple hard links in new directories
The log tree has a long standing problem that when a file is fsync'ed we
only check for new ancestors, created in the current transaction, by
following only the hard link for which the fsync was issued. We follow the
ancestors using the VFS' dget_parent() API. This means that if we create a
new link for a file in a directory that is new (or in an any other new
ancestor directory) and then fsync the file using an old hard link, we end
up not logging the new ancestor, and on log replay that new hard link and
ancestor do not exist. In some cases, involving renames, the file will not
exist at all.

Example:

  mkfs.btrfs -f /dev/sdb
  mount /dev/sdb /mnt

  mkdir /mnt/A
  touch /mnt/foo
  ln /mnt/foo /mnt/A/bar
  xfs_io -c fsync /mnt/foo

  <power failure>

In this example after log replay only the hard link named 'foo' exists
and directory A does not exist, which is unexpected. In other major linux
filesystems, such as ext4, xfs and f2fs for example, both hard links exist
and so does directory A after mounting again the filesystem.

Checking if any new ancestors are new and need to be logged was added in
2009 by commit 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes"),
however only for the ancestors of the hard link (dentry) for which the
fsync was issued, instead of checking for all ancestors for all of the
inode's hard links.

So fix this by tracking the id of the last transaction where a hard link
was created for an inode and then on fsync fallback to a full transaction
commit when an inode has more than one hard link and at least one new hard
link was created in the current transaction. This is the simplest solution
since this is not a common use case (adding frequently hard links for
which there's an ancestor created in the current transaction and then
fsync the file). In case it ever becomes a common use case, a solution
that consists of iterating the fs/subvol btree for each hard link and
check if any ancestor is new, could be implemented.

This solves many unexpected scenarios reported by Jayashree Mohan and
Vijay Chidambaram, and for which there is a new test case for fstests
under review.

Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
CC: stable@vger.kernel.org # 4.4+
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Reported-by: Jayashree Mohan <jayashree2912@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:43 +01:00
David Sterba bbe339cc32 btrfs: drop extra enum initialization where using defaults
The first auto-assigned value to enum is 0, we can use that and not
initialize all members where the auto-increment does the same. This is
used for values that are not part of on-disk format.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:43 +01:00
David Sterba 5b840301ac btrfs: switch BTRFS_ORDERED_* to enums
We can use simple enum for values that are not part of on-disk format:
ordered extent flags.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:43 +01:00
David Sterba 50b5b6020f btrfs: switch EXTENT_FLAG_* to enums
We can use simple enum for values that are not part of on-disk format:
extent map flags.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:43 +01:00
David Sterba 80cb38362d btrfs: switch EXTENT_BUFFER_* to enums
We can use simple enum for values that are not part of on-disk format:
extent buffer flags;

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:43 +01:00
David Sterba 61fa90c16b btrfs: switch BTRFS_ROOT_* to enums
We can use simple enum for values that are not part of on-disk format:
root tree flags.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
David Sterba eb1a524c95 btrfs: switch BTRFS_FS_* to enums
We can use simple enum for values that are not part of on-disk format:
internal filesystem states.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
David Sterba 688a75b9a3 btrfs: switch BTRFS_BLOCK_RSV_* to enums
We can use simple enum for values that are not part of on-disk format:
block reserve types.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
David Sterba b00146b5d5 btrfs: switch BTRFS_FS_STATE_* to enums
We can use simple enum for values that are not part of on-disk format:
global filesystem states.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
Nikolay Borisov da12fe5414 btrfs: Refactor btrfs_merge_bio_hook
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
Lu Fengqi 2ab4fd3135 btrfs: cleanup the useless DEFINE_WAIT in cleanup_transaction
When it was introduced in commit f094ac32ab ("Btrfs: fix NULL pointer
after aborting a transaction"), it was not used.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
Johannes Thumshirn d2e174d5d3 btrfs: document extent mapping assumptions in checksum
Document why map_private_extent_buffer() cannot return '1' (i.e. the map
spans two pages) for the csum_tree_block() case.

The current algorithm for detecting a page boundary crossing in
map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
offset in the page + the offset passed in by csum_tree_block() and the
minimal length passed in by csum_tree_block() - 1 are bigger than
PAGE_SIZE.

We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
and the current extent buffer allocator always guarantees page aligned
extends, so the above condition can't be true.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:41 +01:00
Johannes Thumshirn cc2c39d605 btrfs: don't initialize 'offset' in map_private_extent_buffer()
In map_private_extent_buffer() the 'offset' variable is initialized to a
page aligned version of the 'start' parameter.

But later on it is overwritten with either the offset from the extent
buffer's start or 0.

So get rid of the initial initialization.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:41 +01:00
Filipe Manana a5fb114291 Btrfs: fix deadlock with memory reclaim during scrub
When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.

Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().

We also can not have any of the worker tasks, created by the scrub task,
doing GFP_KERNEL allocations, because before pausing, the scrub task waits
for all the worker tasks to complete (also done at scrub_stripe()).

So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.

Fixes: 58c4e17384 ("btrfs: scrub: use GFP_KERNEL on the submission path")
CC: stable@vger.kernel.org # 4.6+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:41 +01:00
Nikolay Borisov 78e62c02ab btrfs: Remove extent_io_ops::readpage_io_failed_hook
For data inodes this hook does nothing but to return -EAGAIN which is
used to signal to the endio routines that this bio belongs to a data
inode. If this is the case the actual retrying is handled by
bio_readpage_error. Alternatively, if this bio belongs to the btree
inode then btree_io_failed_hook just does some cleanup and doesn't retry
anything.

This patch simplifies the code flow by eliminating
readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
end_bio_extent_readpage. Also eliminate some needless checks since IO is
always performed on either data inode or btree inode, both of which are
guaranteed to have their extent_io_tree::ops set.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:41 +01:00
Johannes Thumshirn 7b41ba71c1 btrfs: remove btrfs_bio_end_io_t
The btrfs_bio_end_io_t typedef was introduced with commit
a1d3c4786a ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
but never used anywhere. This commit also introduced a forward declaration
of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.

Remove both as they're not needed anywhere.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:40 +01:00
David Sterba b3a0dd50c3 btrfs: replace btrfs_io_bio::end_io with a simple helper
The end_io callback implemented as btrfs_io_bio_endio_readpage only
calls kfree. Also the callback is set only in case the csum buffer is
allocated and not pointing to the inline buffer. We can use that
information to drop the indirection and call a helper that will free the
csums only in the right case.

This shrinks struct btrfs_io_bio by 8 bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:40 +01:00
David Sterba 31fecccbd7 btrfs: remove redundant csum buffer in btrfs_io_bio
The io_bio tracks checksums and has an inline buffer or an allocated
one. And there's a third member that points to the right one, but we
don't need to use an extra pointer for that. Let btrfs_io_bio::csum
point to the right buffer and check that the inline buffer is not
accidentally freed.

This shrinks struct btrfs_io_bio by 8 bytes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:40 +01:00
David Sterba 600b6cf468 btrfs: replace async_cow::root with fs_info
The async_cow::root is used to propagate fs_info to async_cow_submit.
We can't use inode to reach it because it could become NULL after
write without compression in async_cow_start.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:40 +01:00
David Sterba 06ea01b1ee btrfs: merge btrfs_submit_bio_done to its caller
There's one caller and its code is simple, we can open code it in
run_one_async_done. The errors are passed through bio.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:40 +01:00
Anand Jain 7333bd02dc btrfs: balance: print to system log when balance ends or is paused
Print a kernel log message when the balance ends, either for cancel or
completed or if it is paused.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:39 +01:00
Anand Jain 56fc37d936 btrfs: balance: print args during start and resume
The information about balance arguments is important for system audit,
this patch prints the textual representation when balance starts or is
resumed.

Example command:

 $ btrfs balance start -f -mprofiles=raid1,convert=single,soft -dlimit=10..20,usage=50 /btrfs

Example kernel log output:

 BTRFS info (device sdb): balance: start -f -dusage=50,limit=10..20 -mconvert=single,soft,profiles=raid1 -sconvert=single,soft,profiles=raid1

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, simplify code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:39 +01:00
Anand Jain f89e09cf45 btrfs: add helper to describe block group flags
Factor out helper that describes block group flags from
describe_relocation. The result will not be longer than the given size.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:39 +01:00
Filipe Manana 9a6f209e36 Btrfs: fix deadlock when enabling quotas due to concurrent snapshot creation
If the quota enable and snapshot creation ioctls are called concurrently
we can get into a deadlock where the task enabling quotas will deadlock
on the fs_info->qgroup_ioctl_lock mutex because it attempts to lock it
twice, or the task creating a snapshot tries to commit the transaction
while the task enabling quota waits for the former task to commit the
transaction while holding the mutex. The following time diagrams show how
both cases happen.

First scenario:

           CPU 0                                    CPU 1

 btrfs_ioctl()
  btrfs_ioctl_quota_ctl()
   btrfs_quota_enable()
    mutex_lock(fs_info->qgroup_ioctl_lock)
    btrfs_start_transaction()

                                             btrfs_ioctl()
                                              btrfs_ioctl_snap_create_v2
                                               create_snapshot()
                                                --> adds snapshot to the
                                                    list pending_snapshots
                                                    of the current
                                                    transaction

    btrfs_commit_transaction()
     create_pending_snapshots()
       create_pending_snapshot()
        qgroup_account_snapshot()
         btrfs_qgroup_inherit()
	   mutex_lock(fs_info->qgroup_ioctl_lock)
	    --> deadlock, mutex already locked
	        by this task at
		btrfs_quota_enable()

Second scenario:

           CPU 0                                    CPU 1

 btrfs_ioctl()
  btrfs_ioctl_quota_ctl()
   btrfs_quota_enable()
    mutex_lock(fs_info->qgroup_ioctl_lock)
    btrfs_start_transaction()

                                             btrfs_ioctl()
                                              btrfs_ioctl_snap_create_v2
                                               create_snapshot()
                                                --> adds snapshot to the
                                                    list pending_snapshots
                                                    of the current
                                                    transaction

                                                btrfs_commit_transaction()
                                                 --> waits for task at
                                                     CPU 0 to release
                                                     its transaction
                                                     handle

    btrfs_commit_transaction()
     --> sees another task started
         the transaction commit first
     --> releases its transaction
         handle
     --> waits for the transaction
         commit to be completed by
         the task at CPU 1

                                                 create_pending_snapshot()
                                                  qgroup_account_snapshot()
                                                   btrfs_qgroup_inherit()
                                                    mutex_lock(fs_info->qgroup_ioctl_lock)
                                                     --> deadlock, task at CPU 0
                                                         has the mutex locked but
                                                         it is waiting for us to
                                                         finish the transaction
                                                         commit

So fix this by setting the quota enabled flag in fs_info after committing
the transaction at btrfs_quota_enable(). This ends up serializing quota
enable and snapshot creation as if the snapshot creation happened just
before the quota enable request. The quota rescan task, scheduled after
committing the transaction in btrfs_quote_enable(), will do the accounting.

Fixes: 6426c7ad69 ("btrfs: qgroup: Fix qgroup accounting when creating snapshot")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:39 +01:00