Commit Graph

124 Commits

Author SHA1 Message Date
Chris Mason a1ed835e1a Btrfs: Fix extent replacment race
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:07 -04:00
Chris Mason 890871be85 Btrfs: switch extent_map to a rw lock
There are two main users of the extent_map tree.  The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-09-11 13:31:05 -04:00
Christoph Hellwig 9601e3f633 Btrfs: kill btrfs_cache_create
Just use kmem_cache_create directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:04 -04:00
Christoph Hellwig 0d4bf11e53 Btrfs: don't export symbols
Currently the extent_map code is only for btrfs so don't export it's
symbols.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:04 -04:00
Dan Carpenter ff0a5836ac Btrfs: remove dead code
merge is always NULL at this point.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-02 16:46:06 -04:00
Huang Weiyi 7eaebe7d50 Btrfs: removed unused #include <version.h>'s
Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Chris Mason d397712bcc Btrfs: Fix checkpatch.pl warnings
There were many, most are fixed now.  struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05 21:25:51 -05:00
Chris Mason c8b978188c Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
Chris Mason d352ac6814 Btrfs: add and improve comments
This improves the comments at the top of many functions.  It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-29 15:18:18 -04:00
Chris Mason 7c2fe32a23 Btrfs: Fix add_extent_mapping to check for duplicates across the whole range
add_extent_mapping was allowing the insertion of overlapping extents.
This never used to happen because it only inserted the extents from disk
and those were never overlapping.

But, with the data=ordered code, the disk and memory representations of the
file are not the same.  add_extent_mapping needs to ensure a new extent
does not overlap before it inserts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
David Woodhouse 64f26f7450 Btrfs: Use assert_spin_locked instead of spin_trylock
On UP systems spin_trylock always succeeds

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason f421950f86 Btrfs: Fix some data=ordered related data corruptions
Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree.  The tree was caching the last
pointer returned, and searches would check the last pointer first.

But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.

For now, the code to cache the last return value is just removed.  It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.

This commit also replaces do_sync_mapping_range with a local copy of the
related functions.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 7f3c74fb83 Btrfs: Keep extent mappings in ram until pending ordered extents are done
It was possible for stale mappings from disk to be used instead of the
new pending ordered extent.  This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason e6dcd2dc9c Btrfs: New data=ordered implementation
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Christoph Hellwig 9d2423c5c3 Btrfs: kerneldoc comments for extent_map.c
Add kerneldoc comments for all exported functions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Christoph Hellwig 306929f364 btrfs: fix strange indentation in lookup_extent_mapping
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason d1310b2e0c Btrfs: Split the extent_map code into two parts
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller.  This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 5f56406aab Btrfs: Fix hole insertion corner cases
There were a few places that could cause duplicate extent insertion,
this adjusts the code that creates holes to avoid it.

lookup_extent_map is changed to correctly return all of the extents in a
range, even when there are none matching at the start of the range.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Yan f0c5da1446 Btrfs: Fix for test_range_bit
test_range_bit doesn't properly handle the case: there's a hole at the
end of the range and there's no other extent_state after the range.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason b3a0d8d28c Btrfs: Remove verbose WARN_ON
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 55c69072d6 Btrfs: Fix extent_buffer usage when nodesize != leafsize
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason bcd987feef Btrfs: Remove extent_map debugging message
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 5d4fb734b4 Btrfs: Fix an off by one in the extent_map prepare write code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 1832a6d5ee Btrfs: Implement basic support for -ENOSPC
This is intended to prevent accidentally filling the drive.  A determined
user can still make things oops.

It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 190662b212 Btrfs: Fix delayed allocation to avoid missing delalloc extents
find_lock_delalloc_range could exit out too early

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 6da6abae02 Btrfs: Back port to 2.6.18-el kernels
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Christian Hesse 17636e03f4 Btrfs: section mismatch warnings
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Hello everybody,

compiling btrfs into the kernel results in section mismatch warnings. __exit
functions are called where they are not allowed to. The attached patch fixes
this for me. Not sure if it is correct though.

Signed-off-by: Christian Hesse <mail@earthworm.de>
--
Regards,
Chris

--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/x-diff; charset="iso-8859-1";
	name="btrfs-section_mismatches.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="btrfs-section_mismatches.patch"

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason ca6646264b Btrfs: Add efficient dirty accounting to the extent_map tree
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 793955bca6 Btrfs: Limit btree writeback to prevent seeks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 015a739c7c Btrfs: Handle writeback under high memory pressure better
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 7073c8e852 Btrfs: Make sure page mapping dirty tag is properly cleared
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 3e9fd94ff0 Btrfs: Avoid fragmentation from parallel delalloc filling
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Wyatt Banks 2f4cbe6442 Btrfs: Return value checking in module init
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 0591fb56fb Btrfs: Fix extent bit range testing
It could return the bit as set when there was actually a hole at the
very end of the range.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 3ab2fb5a8c Btrfs: Add readpages support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 856bf3e592 Btrfs: Avoid extent_buffer lru corruption
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 09be207d1b Btrfs: Fix failure cleanups when allocating extent buffers fail
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason b293f02e14 Btrfs: Add writepages support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan 944746ec75 Btrfs: small fixes for find_lock_delalloc_range.
There is a 'finish_wait', but no 'prepare_to_wait' . So I think that
the 'prepare_to_wait' is missing. The second change is  according to
the name of variable.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 179e29e488 Btrfs: Fix a number of inline extent problems that Yan Zheng reported.
The fixes do a number of things:

1) Most btrfs_drop_extent callers will try to leave the inline extents in
place.  It can truncate bytes off the beginning of the inline extent if
required.

2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.

3) btrfs_truncate_in_transaction truncates inline extents

4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 35ebb934bd Btrfs: Fix PAGE_CACHE_SHIFT shifts on 32 bit machines
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan c67cda1758 Btrfs: Fix extent_map leak in extent_bmap
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan 65555a06b4 Btrfs: Off by one fixes in extent_map.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason ff190c0c00 Btrfs: Avoid recursive KM_USER1 mappings in copy_extent_buffer
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 3685f79165 Btrfs: CPU usage optimizations in push and the extent_map code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 59d169e2b3 Btrfs: Fix read/write_extent_buffer to use KM_USER1 instead of KM_USER0
This avoids recursive use of KM_USER0 during btrfs_file_write

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Jens Axboe 0a2118dfd4 Btrfs: Fix bi_end_io() functions on > 2.6.23 kernels
It now returns void and it is never called for partial completions, so
the bio->bi_size check must go.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Jens Axboe ae2f5411c4 btrfs: 32-bit type problems
An assorted set of casts to get rid of the warnings on 32-bit archs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason ff79f8190b Btrfs: Add back file data checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 19c00ddcc3 Btrfs: Add back metadata checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 810191ff30 Btrfs: extent_map optimizations to cut down on CPU usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 4dc119046d Btrfs: Add an extent buffer LRU to reduce radix tree hits
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason e19caa5f0e Btrfs: Fix allocation routines to avoid intermixing data and metadata allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 09e71a3263 Btrfs: Use an array of pages in the extent buffers to reduce the cost of find_get_page
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 14048ed0c4 Btrfs: Cache extent buffer mappings
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason db94535db7 Btrfs: Allow tree blocks larger than the page size
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 1a5bc167f6 Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 96b5179d0d Btrfs: Stop using radix trees for the block group cache
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason f510cfecfc Btrfs: Fix extent_buffer and extent_state leaks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason ae5252bd51 Btrfs: Go back to kmaps instead of page_address in extent_buffers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 6d36dcd48f Btrfs: Avoid memcpy where possible in extent_buffers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 479965d66e Btrfs: Optimizations for the extent_buffer code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 5f39d397df Btrfs: Create extent_buffer interface for large blocksizes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Christoph Hellwig b3cfa35a49 Btrfs: factor page private preparations into a helper
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Christoph Hellwig 0e2752a72c Btrfs: [PATCH] extent_map: add writepage_end_io hook
XFS updates the ondisk inode size only after the data I/O has finished,
so it needs a hook when the writepage end_bio handler has finished.

Might not be worth applying as-is as the per-page callback is very
ineffcient.  What XFS really wants is a callback when writeout of a
whole extent has completed.  This delayed i_size updates scheme might
be worthwile for btrfs aswell, btw.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 20:02:33 -04:00
Christoph Hellwig b06355f0fe Btrfs: [PATCH] extent_map: make the writepage_io hook optional
The writepage_io is not mandatory, e.g. my port of xfs to the extent_map
code does not have one for now.  So handle a NULL pointer gracefully
here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 20:02:32 -04:00
Christoph Hellwig d396c6f554 Btrfs: [PATCH] extent_map: provide generic bmap
generic_bmap is completely trivial, while the extent to bh mapping in
btrfs is rather complex.  So provide a extent_bmap instead that takes
a get_extent callback and can be used by filesystem using the extent_map
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 20:02:30 -04:00
Christoph Hellwig 90f1c19a9f Btrfs: [PATCH] extent_map: fix locking for bio completion
The bio completion handlers can be run in any context, e.g. when using
the old ide driver they run in hardirq context with irqs disabled so
lockdep rightfully warns about using write_lock_irq useage in these
handlers.

This patch switches clear_extent_bit and set_extent_bit to
write_lock_irqsave to fix this problem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 20:02:27 -04:00
Chris Mason a8c450b211 Btrfs: Reorder tests in set_extent_bit to properly find holes
Yan Zheng noticed that set_extent_bit was exiting too early when there
was a hole in the map.  The fix is to reorder the tests to check for the
hole first.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 20:00:27 -04:00
Chris Mason 86479a04ee Add support for defragging files via btrfsctl -d. Avoid OOM on extent tree
defrag.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-10 19:58:16 -04:00
Chris Mason 2bf5a725a3 Btrfs: fsx delalloc fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-30 11:54:02 -04:00
Chris Mason 07157aacb1 Btrfs: Add file data csums back in via hooks in the extent map code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-30 08:50:51 -04:00
Chris Mason b888db2bd7 Btrfs: Add delayed allocation to the extent based page tree code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27 16:49:44 -04:00
Chris Mason a52d9a8033 Btrfs: Extent based page cache code. This uses an rbtree of extents and tests
instead of buffer heads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27 16:49:44 -04:00