Commit Graph

841854 Commits

Author SHA1 Message Date
Filipe Manana d1d832a0b5 Btrfs: fix data loss after inode eviction, renaming it, and fsync it
When we log an inode, regardless of logging it completely or only that it
exists, we always update it as logged (logged_trans and last_log_commit
fields of the inode are updated). This is generally fine and avoids future
attempts to log it from having to do repeated work that brings no value.

However, if we write data to a file, then evict its inode after all the
dealloc was flushed (and ordered extents completed), rename the file and
fsync it, we end up not logging the new extents, since the rename may
result in logging that the inode exists in case the parent directory was
logged before. The following reproducer shows and explains how this can
happen:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/dir
  $ touch /mnt/dir/foo
  $ touch /mnt/dir/bar

  # Do a direct IO write instead of a buffered write because with a
  # buffered write we would need to make sure dealloc gets flushed and
  # complete before we do the inode eviction later, and we can not do that
  # from user space with call to things such as sync(2) since that results
  # in a transaction commit as well.
  $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar

  # Keep the directory dir in use while we evict inodes. We want our file
  # bar's inode to be evicted but we don't want our directory's inode to
  # be evicted (if it were evicted too, we would not be able to reproduce
  # the issue since the first fsync below, of file foo, would result in a
  # transaction commit.
  $ ( cd /mnt/dir; while true; do :; done ) &
  $ pid=$!

  # Wait a bit to give time for the background process to chdir.
  $ sleep 0.1

  # Evict all inodes, except the inode for the directory dir because it is
  # currently in use by our background process.
  $ echo 2 > /proc/sys/vm/drop_caches

  # fsync file foo, which ends up persisting information about the parent
  # directory because it is a new inode.
  $ xfs_io -c fsync /mnt/dir/foo

  # Rename bar, this results in logging that this inode exists (inode item,
  # names, xattrs) because the parent directory is in the log.
  $ mv /mnt/dir/bar /mnt/dir/baz

  # Now fsync baz, which ends up doing absolutely nothing because of the
  # rename operation which logged that the inode exists only.
  $ xfs_io -c fsync /mnt/dir/baz

  <power failure>

  $ mount /dev/sdb /mnt
  $ od -t x1 -A d /mnt/dir/baz
  0000000

    --> Empty file, data we wrote is missing.

Fix this by not updating last_sub_trans of an inode when we are logging
only that it exists and the inode was not yet logged since it was loaded
from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
return false for this scenario and make us log the inode. The logged_trans
of the inode is still always setsince that alone is used to track if names
need to be deleted as part of unlink operations.

Fixes: 257c62e1bc ("Btrfs: avoid tree log commit when there are no changes")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:48 +02:00
David Sterba 6d58a55a89 btrfs: raid56: clear incompat block group flags after removing the last one
The incompat bit for RAID56 is set either at mount time or automatically
when the profile is used by balance. The part where the bit is removed
is missing and can be unexpected or undesired when an older kernel is
needed.

This patch will drop the incompat bit after this command, assuming
that RAID5 profile is not used by system or metadata:

 $ btrfs balance start -dconvert=raid5 /mnt
 $ btrfs balance start -dconvert=raid1 /mnt

This will print "clearing 128 feature flag" to the system log.

The patch is safe for backporting to older kernels.

Reported-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:48 +02:00
David Sterba 00801ae4bb btrfs: switch extent_buffer write_locks from atomic to int
The write_locks is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
David Sterba f3dc24c52a btrfs: switch extent_buffer spinning_writers from atomic to int
The spinning_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
David Sterba 06297d8cef btrfs: switch extent_buffer blocking_writers from atomic to int
The blocking_writers is either 0 or 1 and always updated under the lock,
so we don't need the atomic_t semantics.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
David Sterba 38e9372e39 btrfs: assert delayed ref lock in btrfs_find_delayed_ref_head
Turn the comment about required lock into an assertion.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:47 +02:00
David Sterba 93ead46b03 btrfs: tests: add locks around add_extent_mapping
There are no concerns about locking during the selftests so the locks
are not necessary, but following patches will add lockdep assertions to
add_extent_mapping so this is needed in tests too.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:03 +02:00
Nikolay Borisov 8666e638b0 btrfs: Document __etree_search
The function has a lot of return values and specific conventions making
it cumbersome to understand what's returned. Have a go at documenting
its parameters and return values.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:03 +02:00
Nikolay Borisov 1eaebb341d btrfs: Don't trim returned range based on input value in find_first_clear_extent_bit
Currently find_first_clear_extent_bit always returns a range whose
starting value is >= passed 'start'. This implicit trimming behavior is
somewhat subtle and an implementation detail.

Instead, this patch modifies the function such that now it always
returns the range which contains passed 'start' and has the given bits
unset. This range could either be due to presence of existing records
which contains 'start' but have the bits unset or because there are no
records that contain the given starting offset.

This patch also adds test cases which cover find_first_clear_extent_bit
since they were missing up until now.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
Nikolay Borisov 53460a4572 btrfs: trim: make reserved device area adjustments more explicit
Currently the first megabyte on a device housing a btrfs filesystem is
exempt from allocation and trimming. Currently this is not a problem
since 'start' is set to 1M at the beginning of btrfs_trim_free_extents
and find_first_clear_extent_bit always returns a range that is >= start.

However, in a follow up patch find_first_clear_extent_bit will be
changed such that it will return a range containing 'start' and this
range may very well be 0...>=1M so 'start'.

Future proof the sole user of find_first_clear_extent_bit by setting
'start' after the function is called. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
David Sterba 6f8e4fd430 btrfs: use file:line format for assertion report
The filename:line format is commonly understood by editors and can be
copy&pasted more easily than the current format.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
Johannes Thumshirn ea41d6b278 btrfs: remove assumption about csum type form btrfs_print_data_csum_error()
btrfs_print_data_csum_error() still assumed checksums to be 32 bit in
size.  Make it size agnostic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
Johannes Thumshirn d5178578bc btrfs: directly call into crypto framework for checksumming
Currently btrfs_csum_data() relied on the crc32c() wrapper around the
crypto framework for calculating the CRCs.

As we have our own crypto_shash structure in the fs_info now, we can
directly call into the crypto framework without going trough the wrapper.

This way we can even remove the btrfs_csum_data() and btrfs_csum_final()
wrappers.

The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre:
crc32c"), which was previously provided by LIBCRC32C config option doing
the same.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
Johannes Thumshirn 6d97c6e31b btrfs: add boilerplate code for directly including the crypto framework
Add boilerplate code for directly including the crypto framework.  This
helps us flipping the switch for new algorithms.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn 51bce6c9b9 btrfs: Simplify btrfs_check_super_csum() and get rid of size assumptions
Now that we have already checked for a valid checksum type before
calling btrfs_check_super_csum(), it can be simplified even further.

While at it get rid of the implicit size assumption of the resulting
checksum as well.

This is a preparation for changing all checksum functionality to use the
crypto layer later.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn 8dc3f22c8b btrfs: check for supported superblock checksum type before checksum validation
Now that we have factorerd out the superblock checksum type validation,
we can check for supported superblock checksum types before doing the
actual validation of the superblock read from disk.

This leads the path to further simplifications of
btrfs_check_super_csum() later on.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn e7e16f4882 btrfs: add common checksum type validation
Currently btrfs is only supporting CRC32C as checksumming algorithm. As
this is about to change provide a function to validate the checksum type
in the superblock against all possible algorithms.

This makes adding new algorithms easier as there are fewer places to
adjust when adding new algorithms.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn 7ebc7e5f2c btrfs: format checksums according to type for printing
Add a small helper for btrfs_print_data_csum_error() which formats the
checksum according to it's type for pretty printing.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ shorten macro name ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn 10fe6ca80d btrfs: don't assume compressed_bio sums to be 4 bytes
BTRFS has the implicit assumption that a checksum in compressed_bio is 4
bytes. While this is true for CRC32C, it is not for any other checksum.

Change the data type to be a byte array and adjust loop index calculation
accordingly.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn 1e25a2e3ca btrfs: don't assume ordered sums to be 4 bytes
BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.

Change the data type to be a byte array and adjust loop index
calculation accordingly.

This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Johannes Thumshirn 4bb3c2e2b5 btrfs: use btrfs_crc32c{,_final}() in for free space cache
The CRC checksum in the free space cache is not dependant on the super
block's csum_type field but always a CRC32C.

So use btrfs_crc32c() and btrfs_crc32c_final() instead of
btrfs_csum_data() and btrfs_csum_final() for computing these checksums.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Johannes Thumshirn 65019df8c3 btrfs: resurrect btrfs_crc32c()
Commit 9678c54388 ("btrfs: Remove custom crc32c init code") removed
the btrfs_crc32c() function, because it was a duplicate of the crc32c()
library function we already have in the kernel.

Resurrect it as a shim wrapper over crc32c() to make following
transformations of the checksumming code in btrfs easier.

Also provide a btrfs_crc32_final() to ease following transformations.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Johannes Thumshirn 5852c8b961 btrfs: use btrfs_csum_data() instead of directly calling crc32c
btrfsic_test_for_metadata() directly calls the crc32c() library function
for calculating the CRC32C checksum, but then uses btrfs_csum_final() to
invert the result.

To ease further refactoring and development around checksumming in BTRFS
convert to calling btrfs_csum_data(), which is a wrapper around
crc32c().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Qu Wenruo a94d1d0cb3 btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to COW without data reservation
[BUG]
The following script can cause unexpected fsync failure:

  #!/bin/bash

  dev=/dev/test/test
  mnt=/mnt/btrfs

  mkfs.btrfs -f $dev -b 512M > /dev/null
  mount $dev $mnt -o nospace_cache

  # Prealloc one extent
  xfs_io -f -c "falloc 8k 64m" $mnt/file1
  # Fill the remaining data space
  xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
  sync

  # Write into the prealloc extent
  xfs_io -c "pwrite 1m 16m" $mnt/file1

  # Reflink then fsync, fsync would fail due to ENOSPC
  xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
  umount $dev

The fsync fails with ENOSPC, and the last page of the buffered write is
lost.

[CAUSE]
This is caused by:
- Btrfs' back reference only has extent level granularity
  So write into shared extent must be COWed even only part of the extent
  is shared.

So for above script we have:
- fallocate
  Create a preallocated extent where we can do NOCOW write.

- fill all the remaining data and unallocated space

- buffered write into preallocated space
  As we have not enough space available for data and the extent is not
  shared (yet) we fall into NOCOW mode.

- reflink
  Now part of the large preallocated extent is shared, later write
  into that extent must be COWed.

- fsync triggers writeback
  But now the extent is shared and therefore we must fallback into COW
  mode, which fails with ENOSPC since there's not enough space to
  allocate data extents.

[WORKAROUND]
The workaround is to ensure any buffered write in the related extents
(not just the reflink source range) get flushed before reflink/dedupe,
so that NOCOW writes succeed that happened before reflinking succeed.

The workaround is expensive, we could do it better by only flushing
NOCOW range, but that needs extra accounting for NOCOW range.
For now, fix the possible data loss first.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Nikolay Borisov 5f791ec31f btrfs: Return EAGAIN if we can't start no snpashot write in check_can_nocow
The first thing code does in check_can_nocow is trying to block
concurrent snapshots. If this fails (due to snpashot already being in
progress) the function returns ENOSPC which makes no sense. Instead
return EAGAIN. Despite this return value not being propagated to callers
it's good practice to return the closest in terms of semantics error
code. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:59 +02:00
Nikolay Borisov 0b6f5d408b btrfs: Add comments on locking of several device-related fields
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:59 +02:00
Nikolay Borisov bd80d94efb btrfs: Always use a cached extent_state in btrfs_lock_and_flush_ordered_range
In case no cached_state argument is passed to
btrfs_lock_and_flush_ordered_range use one locally in the function. This
optimises the case when an ordered extent is found since the unlock
function will be able to unlock that state directly without searching
for it again.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:59 +02:00
Nikolay Borisov 23d31bd476 btrfs: Use newly introduced btrfs_lock_and_flush_ordered_range
There several functions which open code
btrfs_lock_and_flush_ordered_range, just replace them with a call to the
function. No functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:59 +02:00
Nikolay Borisov ffa87214c1 btrfs: add new helper btrfs_lock_and_flush_ordered_range
There is a certain idiom used in multiple places in btrfs' codebase,
dealing with flushing an ordered range. Factor this in a separate
function that can be reused. Future patches will replace the existing
code with that function.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:59 +02:00
Qu Wenruo 1200b51f57 btrfs: remove the incorrect comment on RO fs when btrfs_run_delalloc_range() fails
At the context of btrfs_run_delalloc_range(), we haven't started/joined
a transaction, thus even something went wrong, we can't and won't abort
transaction, thus no way to make the fs RO.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:59 +02:00
Qu Wenruo 480b9b4d84 btrfs: extent-tree: Add trace events for space info numbers update
Add trace event for update_bytes_pinned() and update_bytes_may_use() to
detect underflow better.

The output would be something like (only showing data part):

  ## Buffered write start, 16K total ##
  2255.954 xfs_io/860 btrfs:update_bytes_may_use:(nil)U: type=DATA old=0 diff=4096
  2257.169 sudo/860 btrfs:update_bytes_may_use:(nil)U: type=DATA old=4096 diff=4096
  2257.346 sudo/860 btrfs:update_bytes_may_use:(nil)U: type=DATA old=8192 diff=4096
  2257.542 sudo/860 btrfs:update_bytes_may_use:(nil)U: type=DATA old=12288 diff=4096

  ## Delalloc start ##
  3727.853 kworker/u8:3-e/700 btrfs:update_bytes_may_use:(nil)U: type=DATA old=16384 diff=-16384

  ## Space cache update ##
  3733.132 sudo/862 btrfs:update_bytes_may_use:(nil)U: type=DATA old=0 diff=65536
  3733.169 sudo/862 btrfs:update_bytes_may_use:(nil)U: type=DATA old=65536 diff=-65536
  3739.868 sudo/862 btrfs:update_bytes_may_use:(nil)U: type=DATA old=0 diff=65536
  3739.891 sudo/862 btrfs:update_bytes_may_use:(nil)U: type=DATA old=65536 diff=-65536

These two trace events will allow bcc tool to probe btrfs_space_info
changes and detect underflow with more details (e.g. backtrace for each
update).

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:58 +02:00
Qu Wenruo 0185f364cb btrfs: extent-tree: Add lockdep assert when updating space info
Just add a safe net for btrfs_space_info member updating.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:58 +02:00
David Sterba cff8267228 btrfs: read number of data stripes from map only once
There are several places that call nr_data_stripes, but this value does
not change.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:58 +02:00
David Sterba 72ad813157 btrfs: constify map parameter for nr_parity_stripes and nr_data_stripes
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:58 +02:00
David Sterba 158da513b1 btrfs: refactor helper for bg flags to name conversion
The helper lacks the btrfs_ prefix and the parameter is the raw
blockgroup type, so none of the callers has to do the flags -> index
conversion.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:58 +02:00
David Sterba e3ecdb3fde btrfs: factor out devs_max setting in __btrfs_alloc_chunk
Merge the repeated code before the if-else block.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:57 +02:00
David Sterba 8c3e3582a4 btrfs: use u8 for raid_array members
The raid_attr table is now 7 * 56 = 392 bytes long, consisting of just
small numbers so we don't have to use ints. New size is 7 * 32 = 224,
saving 3 cachelines.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:57 +02:00
David Sterba 946c9256c6 btrfs: factor out helper for counting data stripes
Factor the sequence of ifs to a helper, the 'data stripes' here means
the number of stripes without redundancy and parity.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:57 +02:00
David Sterba 44b28adafd btrfs: use raid_attr table for btrfs_bg_type_to_factor
The factor is the number of copies.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:57 +02:00
David Sterba 6079e12cdb btrfs: use raid_attr table to find profiles for integrity lowering
Replace open coded list of the profiles by selecting them from the
raid_attr table. The criteria are now more explicit, we need profiles
that have more than 1 copy of the data or can reconstruct the data with
a missing device.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:57 +02:00
David Sterba 081db89b13 btrfs: use raid_attr to get allowed profiles for balance conversion
Iterate over the table and gather all allowed profiles for a given
number of devices, instead of open coding.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:56 +02:00
David Sterba fc9a2ac77c btrfs: use raid_attr in btrfs_chunk_max_errors
The number of tolerated failures is stored in the raid_attr table, use
it.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:56 +02:00
David Sterba 9fa02ac75b btrfs: use raid_attr table in get_profile_num_devs
The dev_max constraints are defined in the raid_attr table, use it
instead of open-coding it.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:56 +02:00
David Sterba c8bf1b6703 btrfs: remove mapping tree structures indirection
fs_info::mapping_tree is the physical<->logical mapping tree and uses
the same underlying structure as extents, but is embedded to another
structure. There are no other members and this indirection is useless.
No functional change.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:56 +02:00
David Sterba 49cc180ca9 btrfs: raid56: allow the exact minimum number of devices for balance convert
The minimum number of devices for RAID5 is 2, though this is only a bit
expensive RAID1, and for RAID6 it's 3, which is a triple copy that works
only 3 devices.

mkfs.btrfs allows that and mounting such filesystem also works, so the
conversion via balance filters is inconsistent with the others and we
should not prevent it.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:56 +02:00
David Sterba 0ee5f8ae08 btrfs: fix minimum number of chunk errors for DUP
The list of profiles in btrfs_chunk_max_errors lists DUP as a profile
DUP able to tolerate 1 device missing. Though this profile is special
with 2 copies, it still needs the device, unlike the others.

Looking at the history of changes, thre's no clear reason why DUP is
there, functions were refactored and blocks of code merged to one
helper.

d20983b40e Btrfs: fix writing data into the seed filesystem
  - factor code to a helper

de11cc12df Btrfs: don't pre-allocate btrfs bio
  - unrelated change, DUP still in the list with max errors 1

a236aed14c Btrfs: Deal with failed writes in mirrored configurations
  - introduced the max errors, leaves DUP and RAID1 in the same group

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:55 +02:00
Liu Bo be9b8dfa9c Btrfs: remove unused variables in __btrfs_unlink_inode
This code was first introduced in 5f39d397df ("Btrfs: Create
extent_buffer interface for large blocksizes") and the function was
named btrfs_unlink_trans. It later got renamed to __btrfs_unlink_inode
and finally commit 16cdcec736 ("btrfs: implement delayed inode items
operation") changed the way inodes are deleted and obviated the need for
those two members.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ replace changelog by Nikolay's version ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:55 +02:00
Goldwyn Rodrigues cebf05ca65 btrfs: Remove unused variable mode in btrfs_mount
This is a leftover from 312c89fbca ("btrfs: cleanup btrfs_mount()
using btrfs_mount_root()"), the mode was used for opening devices that's
not done here anymore.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:55 +02:00
Su Yue 8f63a84051 btrfs: switch order of unlocks of space_info and bg in do_trimming()
In function do_trimming(), block_group->lock should be unlocked first,
as the locks should be released in the reverse order. This does not
cause problems but should follow the best practices.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:55 +02:00
Qu Wenruo 4c094c33c9 btrfs: tree-checker: Check if the file extent end overflows
Under certain conditions, we could have strange file extent item in log
tree like:

  item 18 key (69599 108 397312) itemoff 15208 itemsize 53
	extent data disk bytenr 0 nr 0
	extent data offset 0 nr 18446744073709547520 ram 18446744073709547520

The num_bytes + ram_bytes overflow 64 bit type.

For num_bytes part, we can detect such overflow along with file offset
(key->offset), as file_offset + num_bytes should never go beyond u64.

For ram_bytes part, it's about the decompressed size of the extent, not
directly related to the size.
In theory it is OK to have a large value, and put extra limitation
on RAM bytes may cause unexpected false alerts.

So in tree-checker, we only check if the file offset and num bytes
overflow.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:34:55 +02:00