Merge git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs

* git://git.kernel.org/pub/scm/linux/kernel/git/joern/logfs:
  [LogFS] Change magic number
  [LogFS] Remove h_version field
  [LogFS] Check feature flags
  [LogFS] Only write journal if dirty
  [LogFS] Fix bdev erases
  [LogFS] Silence gcc
  [LogFS] Prevent 64bit divisions in hash_index
  [LogFS] Plug memory leak on error paths
  [LogFS] Add MAINTAINERS entry
  [LogFS] add new flash file system

Fixed up trivial conflict in lib/Kconfig, and a semantic conflict in
fs/logfs/inode.c introduced by write_inode() being changed to use
writeback_control' by commit a9185b41a4
("pass writeback_control to ->write_inode")
This commit is contained in:
Linus Torvalds 2010-03-06 13:18:03 -08:00
commit 66b89159c2
26 changed files with 10554 additions and 0 deletions

View File

@ -62,6 +62,8 @@ jfs.txt
- info and mount options for the JFS filesystem.
locks.txt
- info on file locking implementations, flock() vs. fcntl(), etc.
logfs.txt
- info on the LogFS flash filesystem.
mandatory-locking.txt
- info on the Linux implementation of Sys V mandatory file locking.
ncpfs.txt

View File

@ -0,0 +1,241 @@
The LogFS Flash Filesystem
==========================
Specification
=============
Superblocks
-----------
Two superblocks exist at the beginning and end of the filesystem.
Each superblock is 256 Bytes large, with another 3840 Bytes reserved
for future purposes, making a total of 4096 Bytes.
Superblock locations may differ for MTD and block devices. On MTD the
first non-bad block contains a superblock in the first 4096 Bytes and
the last non-bad block contains a superblock in the last 4096 Bytes.
On block devices, the first 4096 Bytes of the device contain the first
superblock and the last aligned 4096 Byte-block contains the second
superblock.
For the most part, the superblocks can be considered read-only. They
are written only to correct errors detected within the superblocks,
move the journal and change the filesystem parameters through tunefs.
As a result, the superblock does not contain any fields that require
constant updates, like the amount of free space, etc.
Segments
--------
The space in the device is split up into equal-sized segments.
Segments are the primary write unit of LogFS. Within each segments,
writes happen from front (low addresses) to back (high addresses. If
only a partial segment has been written, the segment number, the
current position within and optionally a write buffer are stored in
the journal.
Segments are erased as a whole. Therefore Garbage Collection may be
required to completely free a segment before doing so.
Journal
--------
The journal contains all global information about the filesystem that
is subject to frequent change. At mount time, it has to be scanned
for the most recent commit entry, which contains a list of pointers to
all currently valid entries.
Object Store
------------
All space except for the superblocks and journal is part of the object
store. Each segment contains a segment header and a number of
objects, each consisting of the object header and the payload.
Objects are either inodes, directory entries (dentries), file data
blocks or indirect blocks.
Levels
------
Garbage collection (GC) may fail if all data is written
indiscriminately. One requirement of GC is that data is seperated
roughly according to the distance between the tree root and the data.
Effectively that means all file data is on level 0, indirect blocks
are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks,
respectively. Inode file data is on level 6 for the inodes and 7-11
for indirect blocks.
Each segment contains objects of a single level only. As a result,
each level requires its own seperate segment to be open for writing.
Inode File
----------
All inodes are stored in a special file, the inode file. Single
exception is the inode file's inode (master inode) which for obvious
reasons is stored in the journal instead. Instead of data blocks, the
leaf nodes of the inode files are inodes.
Aliases
-------
Writes in LogFS are done by means of a wandering tree. A naïve
implementation would require that for each write or a block, all
parent blocks are written as well, since the block pointers have
changed. Such an implementation would not be very efficient.
In LogFS, the block pointer changes are cached in the journal by means
of alias entries. Each alias consists of its logical address - inode
number, block index, level and child number (index into block) - and
the changed data. Any 8-byte word can be changes in this manner.
Currently aliases are used for block pointers, file size, file used
bytes and the height of an inodes indirect tree.
Segment Aliases
---------------
Related to regular aliases, these are used to handle bad blocks.
Initially, bad blocks are handled by moving the affected segment
content to a spare segment and noting this move in the journal with a
segment alias, a simple (to, from) tupel. GC will later empty this
segment and the alias can be removed again. This is used on MTD only.
Vim
---
By cleverly predicting the life time of data, it is possible to
seperate long-living data from short-living data and thereby reduce
the GC overhead later. Each type of distinc life expectency (vim) can
have a seperate segment open for writing. Each (level, vim) tupel can
be open just once. If an open segment with unknown vim is encountered
at mount time, it is closed and ignored henceforth.
Indirect Tree
-------------
Inodes in LogFS are similar to FFS-style filesystems with direct and
indirect block pointers. One difference is that LogFS uses a single
indirect pointer that can be either a 1x, 2x, etc. indirect pointer.
A height field in the inode defines the height of the indirect tree
and thereby the indirection of the pointer.
Another difference is the addressing of indirect blocks. In LogFS,
the first 16 pointers in the first indirect block are left empty,
corresponding to the 16 direct pointers in the inode. In ext2 (maybe
others as well) the first pointer in the first indirect block
corresponds to logical block 12, skipping the 12 direct pointers.
So where ext2 is using arithmetic to better utilize space, LogFS keeps
arithmetic simple and uses compression to save space.
Compression
-----------
Both file data and metadata can be compressed. Compression for file
data can be enabled with chattr +c and disabled with chattr -c. Doing
so has no effect on existing data, but new data will be stored
accordingly. New inodes will inherit the compression flag of the
parent directory.
Metadata is always compressed. However, the space accounting ignores
this and charges for the uncompressed size. Failing to do so could
result in GC failures when, after moving some data, indirect blocks
compress worse than previously. Even on a 100% full medium, GC may
not consume any extra space, so the compression gains are lost space
to the user.
However, they are not lost space to the filesystem internals. By
cheating the user for those bytes, the filesystem gained some slack
space and GC will run less often and faster.
Garbage Collection and Wear Leveling
------------------------------------
Garbage collection is invoked whenever the number of free segments
falls below a threshold. The best (known) candidate is picked based
on the least amount of valid data contained in the segment. All
remaining valid data is copied elsewhere, thereby invalidating it.
The GC code also checks for aliases and writes then back if their
number gets too large.
Wear leveling is done by occasionally picking a suboptimal segment for
garbage collection. If a stale segments erase count is significantly
lower than the active segments' erase counts, it will be picked. Wear
leveling is rate limited, so it will never monopolize the device for
more than one segment worth at a time.
Values for "occasionally", "significantly lower" are compile time
constants.
Hashed directories
------------------
To satisfy efficient lookup(), directory entries are hashed and
located based on the hash. In order to both support large directories
and not be overly inefficient for small directories, several hash
tables of increasing size are used. For each table, the hash value
modulo the table size gives the table index.
Tables sizes are chosen to limit the number of indirect blocks with a
fully populated table to 0, 1, 2 or 3 respectively. So the first
table contains 16 entries, the second 512-16, etc.
The last table is special in several ways. First its size depends on
the effective 32bit limit on telldir/seekdir cookies. Since logfs
uses the upper half of the address space for indirect blocks, the size
is limited to 2^31. Secondly the table contains hash buckets with 16
entries each.
Using single-entry buckets would result in birthday "attacks". At
just 2^16 used entries, hash collisions would be likely (P >= 0.5).
My math skills are insufficient to do the combinatorics for the 17x
collisions necessary to overflow a bucket, but testing showed that in
10,000 runs the lowest directory fill before a bucket overflow was
188,057,130 entries with an average of 315,149,915 entries. So for
directory sizes of up to a million, bucket overflows should be
virtually impossible under normal circumstances.
With carefully chosen filenames, it is obviously possible to cause an
overflow with just 21 entries (4 higher tables + 16 entries + 1). So
there may be a security concern if a malicious user has write access
to a directory.
Open For Discussion
===================
Device Address Space
--------------------
A device address space is used for caching. Both block devices and
MTD provide functions to either read a single page or write a segment.
Partial segments may be written for data integrity, but where possible
complete segments are written for performance on simple block device
flash media.
Meta Inodes
-----------
Inodes are stored in the inode file, which is just a regular file for
most purposes. At umount time, however, the inode file needs to
remain open until all dirty inodes are written. So
generic_shutdown_super() may not close this inode, but shouldn't
complain about remaining inodes due to the inode file either. Same
goes for mapping inode of the device address space.
Currently logfs uses a hack that essentially copies part of fs/inode.c
code over. A general solution would be preferred.
Indirect block mapping
----------------------
With compression, the block device (or mapping inode) cannot be used
to cache indirect blocks. Some other place is required. Currently
logfs uses the top half of each inode's address space. The low 8TB
(on 32bit) are filled with file data, the high 8TB are used for
indirect blocks.
One problem is that 16TB files created on 64bit systems actually have
data in the top 8TB. But files >16TB would cause problems anyway, so
only the limit has changed.

View File

@ -3450,6 +3450,13 @@ S: Maintained
F: Documentation/ldm.txt
F: fs/partitions/ldm.*
LogFS
M: Joern Engel <joern@logfs.org>
L: logfs@logfs.org
W: logfs.org
S: Maintained
F: fs/logfs/
LSILOGIC MPT FUSION DRIVERS (FC/SAS/SPI)
M: Eric Moore <Eric.Moore@lsi.com>
M: support@lsi.com

View File

@ -177,6 +177,7 @@ source "fs/efs/Kconfig"
source "fs/jffs2/Kconfig"
# UBIFS File system configuration
source "fs/ubifs/Kconfig"
source "fs/logfs/Kconfig"
source "fs/cramfs/Kconfig"
source "fs/squashfs/Kconfig"
source "fs/freevxfs/Kconfig"

View File

@ -99,6 +99,7 @@ obj-$(CONFIG_NTFS_FS) += ntfs/
obj-$(CONFIG_UFS_FS) += ufs/
obj-$(CONFIG_EFS_FS) += efs/
obj-$(CONFIG_JFFS2_FS) += jffs2/
obj-$(CONFIG_LOGFS) += logfs/
obj-$(CONFIG_UBIFS_FS) += ubifs/
obj-$(CONFIG_AFFS_FS) += affs/
obj-$(CONFIG_ROMFS_FS) += romfs/

17
fs/logfs/Kconfig Normal file
View File

@ -0,0 +1,17 @@
config LOGFS
tristate "LogFS file system (EXPERIMENTAL)"
depends on (MTD || BLOCK) && EXPERIMENTAL
select ZLIB_INFLATE
select ZLIB_DEFLATE
select CRC32
select BTREE
help
Flash filesystem aimed to scale efficiently to large devices.
In comparison to JFFS2 it offers significantly faster mount
times and potentially less RAM usage, although the latter has
not been measured yet.
In its current state it is still very experimental and should
not be used for other than testing purposes.
If unsure, say N.

13
fs/logfs/Makefile Normal file
View File

@ -0,0 +1,13 @@
obj-$(CONFIG_LOGFS) += logfs.o
logfs-y += compr.o
logfs-y += dir.o
logfs-y += file.o
logfs-y += gc.o
logfs-y += inode.o
logfs-y += journal.o
logfs-y += readwrite.o
logfs-y += segment.o
logfs-y += super.o
logfs-$(CONFIG_BLOCK) += dev_bdev.o
logfs-$(CONFIG_MTD) += dev_mtd.o

95
fs/logfs/compr.c Normal file
View File

@ -0,0 +1,95 @@
/*
* fs/logfs/compr.c - compression routines
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/vmalloc.h>
#include <linux/zlib.h>
#define COMPR_LEVEL 3
static DEFINE_MUTEX(compr_mutex);
static struct z_stream_s stream;
int logfs_compress(void *in, void *out, size_t inlen, size_t outlen)
{
int err, ret;
ret = -EIO;
mutex_lock(&compr_mutex);
err = zlib_deflateInit(&stream, COMPR_LEVEL);
if (err != Z_OK)
goto error;
stream.next_in = in;
stream.avail_in = inlen;
stream.total_in = 0;
stream.next_out = out;
stream.avail_out = outlen;
stream.total_out = 0;
err = zlib_deflate(&stream, Z_FINISH);
if (err != Z_STREAM_END)
goto error;
err = zlib_deflateEnd(&stream);
if (err != Z_OK)
goto error;
if (stream.total_out >= stream.total_in)
goto error;
ret = stream.total_out;
error:
mutex_unlock(&compr_mutex);
return ret;
}
int logfs_uncompress(void *in, void *out, size_t inlen, size_t outlen)
{
int err, ret;
ret = -EIO;
mutex_lock(&compr_mutex);
err = zlib_inflateInit(&stream);
if (err != Z_OK)
goto error;
stream.next_in = in;
stream.avail_in = inlen;
stream.total_in = 0;
stream.next_out = out;
stream.avail_out = outlen;
stream.total_out = 0;
err = zlib_inflate(&stream, Z_FINISH);
if (err != Z_STREAM_END)
goto error;
err = zlib_inflateEnd(&stream);
if (err != Z_OK)
goto error;
ret = 0;
error:
mutex_unlock(&compr_mutex);
return ret;
}
int __init logfs_compr_init(void)
{
size_t size = max(zlib_deflate_workspacesize(),
zlib_inflate_workspacesize());
stream.workspace = vmalloc(size);
if (!stream.workspace)
return -ENOMEM;
return 0;
}
void logfs_compr_exit(void)
{
vfree(stream.workspace);
}

327
fs/logfs/dev_bdev.c Normal file
View File

@ -0,0 +1,327 @@
/*
* fs/logfs/dev_bdev.c - Device access methods for block devices
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/buffer_head.h>
#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))
static void request_complete(struct bio *bio, int err)
{
complete((struct completion *)bio->bi_private);
}
static int sync_request(struct page *page, struct block_device *bdev, int rw)
{
struct bio bio;
struct bio_vec bio_vec;
struct completion complete;
bio_init(&bio);
bio.bi_io_vec = &bio_vec;
bio_vec.bv_page = page;
bio_vec.bv_len = PAGE_SIZE;
bio_vec.bv_offset = 0;
bio.bi_vcnt = 1;
bio.bi_idx = 0;
bio.bi_size = PAGE_SIZE;
bio.bi_bdev = bdev;
bio.bi_sector = page->index * (PAGE_SIZE >> 9);
init_completion(&complete);
bio.bi_private = &complete;
bio.bi_end_io = request_complete;
submit_bio(rw, &bio);
generic_unplug_device(bdev_get_queue(bdev));
wait_for_completion(&complete);
return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO;
}
static int bdev_readpage(void *_sb, struct page *page)
{
struct super_block *sb = _sb;
struct block_device *bdev = logfs_super(sb)->s_bdev;
int err;
err = sync_request(page, bdev, READ);
if (err) {
ClearPageUptodate(page);
SetPageError(page);
} else {
SetPageUptodate(page);
ClearPageError(page);
}
unlock_page(page);
return err;
}
static DECLARE_WAIT_QUEUE_HEAD(wq);
static void writeseg_end_io(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
struct super_block *sb = bio->bi_private;
struct logfs_super *super = logfs_super(sb);
struct page *page;
BUG_ON(!uptodate); /* FIXME: Retry io or write elsewhere */
BUG_ON(err);
BUG_ON(bio->bi_vcnt == 0);
do {
page = bvec->bv_page;
if (--bvec >= bio->bi_io_vec)
prefetchw(&bvec->bv_page->flags);
end_page_writeback(page);
} while (bvec >= bio->bi_io_vec);
bio_put(bio);
if (atomic_dec_and_test(&super->s_pending_writes))
wake_up(&wq);
}
static int __bdev_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
size_t nr_pages)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct bio *bio;
struct page *page;
struct request_queue *q = bdev_get_queue(sb->s_bdev);
unsigned int max_pages = queue_max_hw_sectors(q) >> (PAGE_SHIFT - 9);
int i;
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio); /* FIXME: handle this */
for (i = 0; i < nr_pages; i++) {
if (i >= max_pages) {
/* Block layer cannot split bios :( */
bio->bi_vcnt = i;
bio->bi_idx = 0;
bio->bi_size = i * PAGE_SIZE;
bio->bi_bdev = super->s_bdev;
bio->bi_sector = ofs >> 9;
bio->bi_private = sb;
bio->bi_end_io = writeseg_end_io;
atomic_inc(&super->s_pending_writes);
submit_bio(WRITE, bio);
ofs += i * PAGE_SIZE;
index += i;
nr_pages -= i;
i = 0;
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio);
}
page = find_lock_page(mapping, index + i);
BUG_ON(!page);
bio->bi_io_vec[i].bv_page = page;
bio->bi_io_vec[i].bv_len = PAGE_SIZE;
bio->bi_io_vec[i].bv_offset = 0;
BUG_ON(PageWriteback(page));
set_page_writeback(page);
unlock_page(page);
}
bio->bi_vcnt = nr_pages;
bio->bi_idx = 0;
bio->bi_size = nr_pages * PAGE_SIZE;
bio->bi_bdev = super->s_bdev;
bio->bi_sector = ofs >> 9;
bio->bi_private = sb;
bio->bi_end_io = writeseg_end_io;
atomic_inc(&super->s_pending_writes);
submit_bio(WRITE, bio);
return 0;
}
static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len)
{
struct logfs_super *super = logfs_super(sb);
int head;
BUG_ON(super->s_flags & LOGFS_SB_FLAG_RO);
if (len == 0) {
/* This can happen when the object fit perfectly into a
* segment, the segment gets written per sync and subsequently
* closed.
*/
return;
}
head = ofs & (PAGE_SIZE - 1);
if (head) {
ofs -= head;
len += head;
}
len = PAGE_ALIGN(len);
__bdev_writeseg(sb, ofs, ofs >> PAGE_SHIFT, len >> PAGE_SHIFT);
generic_unplug_device(bdev_get_queue(logfs_super(sb)->s_bdev));
}
static void erase_end_io(struct bio *bio, int err)
{
const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct super_block *sb = bio->bi_private;
struct logfs_super *super = logfs_super(sb);
BUG_ON(!uptodate); /* FIXME: Retry io or write elsewhere */
BUG_ON(err);
BUG_ON(bio->bi_vcnt == 0);
bio_put(bio);
if (atomic_dec_and_test(&super->s_pending_writes))
wake_up(&wq);
}
static int do_erase(struct super_block *sb, u64 ofs, pgoff_t index,
size_t nr_pages)
{
struct logfs_super *super = logfs_super(sb);
struct bio *bio;
struct request_queue *q = bdev_get_queue(sb->s_bdev);
unsigned int max_pages = queue_max_hw_sectors(q) >> (PAGE_SHIFT - 9);
int i;
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio); /* FIXME: handle this */
for (i = 0; i < nr_pages; i++) {
if (i >= max_pages) {
/* Block layer cannot split bios :( */
bio->bi_vcnt = i;
bio->bi_idx = 0;
bio->bi_size = i * PAGE_SIZE;
bio->bi_bdev = super->s_bdev;
bio->bi_sector = ofs >> 9;
bio->bi_private = sb;
bio->bi_end_io = erase_end_io;
atomic_inc(&super->s_pending_writes);
submit_bio(WRITE, bio);
ofs += i * PAGE_SIZE;
index += i;
nr_pages -= i;
i = 0;
bio = bio_alloc(GFP_NOFS, max_pages);
BUG_ON(!bio);
}
bio->bi_io_vec[i].bv_page = super->s_erase_page;
bio->bi_io_vec[i].bv_len = PAGE_SIZE;
bio->bi_io_vec[i].bv_offset = 0;
}
bio->bi_vcnt = nr_pages;
bio->bi_idx = 0;
bio->bi_size = nr_pages * PAGE_SIZE;
bio->bi_bdev = super->s_bdev;
bio->bi_sector = ofs >> 9;
bio->bi_private = sb;
bio->bi_end_io = erase_end_io;
atomic_inc(&super->s_pending_writes);
submit_bio(WRITE, bio);
return 0;
}
static int bdev_erase(struct super_block *sb, loff_t to, size_t len,
int ensure_write)
{
struct logfs_super *super = logfs_super(sb);
BUG_ON(to & (PAGE_SIZE - 1));
BUG_ON(len & (PAGE_SIZE - 1));
if (super->s_flags & LOGFS_SB_FLAG_RO)
return -EROFS;
if (ensure_write) {
/*
* Object store doesn't care whether erases happen or not.
* But for the journal they are required. Otherwise a scan
* can find an old commit entry and assume it is the current
* one, travelling back in time.
*/
do_erase(sb, to, to >> PAGE_SHIFT, len >> PAGE_SHIFT);
}
return 0;
}
static void bdev_sync(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
wait_event(wq, atomic_read(&super->s_pending_writes) == 0);
}
static struct page *bdev_find_first_sb(struct super_block *sb, u64 *ofs)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = bdev_readpage;
*ofs = 0;
return read_cache_page(mapping, 0, filler, sb);
}
static struct page *bdev_find_last_sb(struct super_block *sb, u64 *ofs)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = bdev_readpage;
u64 pos = (super->s_bdev->bd_inode->i_size & ~0xfffULL) - 0x1000;
pgoff_t index = pos >> PAGE_SHIFT;
*ofs = pos;
return read_cache_page(mapping, index, filler, sb);
}
static int bdev_write_sb(struct super_block *sb, struct page *page)
{
struct block_device *bdev = logfs_super(sb)->s_bdev;
/* Nothing special to do for block devices. */
return sync_request(page, bdev, WRITE);
}
static void bdev_put_device(struct super_block *sb)
{
close_bdev_exclusive(logfs_super(sb)->s_bdev, FMODE_READ|FMODE_WRITE);
}
static const struct logfs_device_ops bd_devops = {
.find_first_sb = bdev_find_first_sb,
.find_last_sb = bdev_find_last_sb,
.write_sb = bdev_write_sb,
.readpage = bdev_readpage,
.writeseg = bdev_writeseg,
.erase = bdev_erase,
.sync = bdev_sync,
.put_device = bdev_put_device,
};
int logfs_get_sb_bdev(struct file_system_type *type, int flags,
const char *devname, struct vfsmount *mnt)
{
struct block_device *bdev;
bdev = open_bdev_exclusive(devname, FMODE_READ|FMODE_WRITE, type);
if (IS_ERR(bdev))
return PTR_ERR(bdev);
if (MAJOR(bdev->bd_dev) == MTD_BLOCK_MAJOR) {
int mtdnr = MINOR(bdev->bd_dev);
close_bdev_exclusive(bdev, FMODE_READ|FMODE_WRITE);
return logfs_get_sb_mtd(type, flags, mtdnr, mnt);
}
return logfs_get_sb_device(type, flags, NULL, bdev, &bd_devops, mnt);
}

254
fs/logfs/dev_mtd.c Normal file
View File

@ -0,0 +1,254 @@
/*
* fs/logfs/dev_mtd.c - Device access methods for MTD
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/completion.h>
#include <linux/mount.h>
#include <linux/sched.h>
#define PAGE_OFS(ofs) ((ofs) & (PAGE_SIZE-1))
static int mtd_read(struct super_block *sb, loff_t ofs, size_t len, void *buf)
{
struct mtd_info *mtd = logfs_super(sb)->s_mtd;
size_t retlen;
int ret;
ret = mtd->read(mtd, ofs, len, &retlen, buf);
BUG_ON(ret == -EINVAL);
if (ret)
return ret;
/* Not sure if we should loop instead. */
if (retlen != len)
return -EIO;
return 0;
}
static int mtd_write(struct super_block *sb, loff_t ofs, size_t len, void *buf)
{
struct logfs_super *super = logfs_super(sb);
struct mtd_info *mtd = super->s_mtd;
size_t retlen;
loff_t page_start, page_end;
int ret;
if (super->s_flags & LOGFS_SB_FLAG_RO)
return -EROFS;
BUG_ON((ofs >= mtd->size) || (len > mtd->size - ofs));
BUG_ON(ofs != (ofs >> super->s_writeshift) << super->s_writeshift);
BUG_ON(len > PAGE_CACHE_SIZE);
page_start = ofs & PAGE_CACHE_MASK;
page_end = PAGE_CACHE_ALIGN(ofs + len) - 1;
ret = mtd->write(mtd, ofs, len, &retlen, buf);
if (ret || (retlen != len))
return -EIO;
return 0;
}
/*
* For as long as I can remember (since about 2001) mtd->erase has been an
* asynchronous interface lacking the first driver to actually use the
* asynchronous properties. So just to prevent the first implementor of such
* a thing from breaking logfs in 2350, we do the usual pointless dance to
* declare a completion variable and wait for completion before returning
* from mtd_erase(). What an excercise in futility!
*/
static void logfs_erase_callback(struct erase_info *ei)
{
complete((struct completion *)ei->priv);
}
static int mtd_erase_mapping(struct super_block *sb, loff_t ofs, size_t len)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct page *page;
pgoff_t index = ofs >> PAGE_SHIFT;
for (index = ofs >> PAGE_SHIFT; index < (ofs + len) >> PAGE_SHIFT; index++) {
page = find_get_page(mapping, index);
if (!page)
continue;
memset(page_address(page), 0xFF, PAGE_SIZE);
page_cache_release(page);
}
return 0;
}
static int mtd_erase(struct super_block *sb, loff_t ofs, size_t len,
int ensure_write)
{
struct mtd_info *mtd = logfs_super(sb)->s_mtd;
struct erase_info ei;
DECLARE_COMPLETION_ONSTACK(complete);
int ret;
BUG_ON(len % mtd->erasesize);
if (logfs_super(sb)->s_flags & LOGFS_SB_FLAG_RO)
return -EROFS;
memset(&ei, 0, sizeof(ei));
ei.mtd = mtd;
ei.addr = ofs;
ei.len = len;
ei.callback = logfs_erase_callback;
ei.priv = (long)&complete;
ret = mtd->erase(mtd, &ei);
if (ret)
return -EIO;
wait_for_completion(&complete);
if (ei.state != MTD_ERASE_DONE)
return -EIO;
return mtd_erase_mapping(sb, ofs, len);
}
static void mtd_sync(struct super_block *sb)
{
struct mtd_info *mtd = logfs_super(sb)->s_mtd;
if (mtd->sync)
mtd->sync(mtd);
}
static int mtd_readpage(void *_sb, struct page *page)
{
struct super_block *sb = _sb;
int err;
err = mtd_read(sb, page->index << PAGE_SHIFT, PAGE_SIZE,
page_address(page));
if (err == -EUCLEAN) {
err = 0;
/* FIXME: force GC this segment */
}
if (err) {
ClearPageUptodate(page);
SetPageError(page);
} else {
SetPageUptodate(page);
ClearPageError(page);
}
unlock_page(page);
return err;
}
static struct page *mtd_find_first_sb(struct super_block *sb, u64 *ofs)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = mtd_readpage;
struct mtd_info *mtd = super->s_mtd;
if (!mtd->block_isbad)
return NULL;
*ofs = 0;
while (mtd->block_isbad(mtd, *ofs)) {
*ofs += mtd->erasesize;
if (*ofs >= mtd->size)
return NULL;
}
BUG_ON(*ofs & ~PAGE_MASK);
return read_cache_page(mapping, *ofs >> PAGE_SHIFT, filler, sb);
}
static struct page *mtd_find_last_sb(struct super_block *sb, u64 *ofs)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = mtd_readpage;
struct mtd_info *mtd = super->s_mtd;
if (!mtd->block_isbad)
return NULL;
*ofs = mtd->size - mtd->erasesize;
while (mtd->block_isbad(mtd, *ofs)) {
*ofs -= mtd->erasesize;
if (*ofs <= 0)
return NULL;
}
*ofs = *ofs + mtd->erasesize - 0x1000;
BUG_ON(*ofs & ~PAGE_MASK);
return read_cache_page(mapping, *ofs >> PAGE_SHIFT, filler, sb);
}
static int __mtd_writeseg(struct super_block *sb, u64 ofs, pgoff_t index,
size_t nr_pages)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct page *page;
int i, err;
for (i = 0; i < nr_pages; i++) {
page = find_lock_page(mapping, index + i);
BUG_ON(!page);
err = mtd_write(sb, page->index << PAGE_SHIFT, PAGE_SIZE,
page_address(page));
unlock_page(page);
page_cache_release(page);
if (err)
return err;
}
return 0;
}
static void mtd_writeseg(struct super_block *sb, u64 ofs, size_t len)
{
struct logfs_super *super = logfs_super(sb);
int head;
if (super->s_flags & LOGFS_SB_FLAG_RO)
return;
if (len == 0) {
/* This can happen when the object fit perfectly into a
* segment, the segment gets written per sync and subsequently
* closed.
*/
return;
}
head = ofs & (PAGE_SIZE - 1);
if (head) {
ofs -= head;
len += head;
}
len = PAGE_ALIGN(len);
__mtd_writeseg(sb, ofs, ofs >> PAGE_SHIFT, len >> PAGE_SHIFT);
}
static void mtd_put_device(struct super_block *sb)
{
put_mtd_device(logfs_super(sb)->s_mtd);
}
static const struct logfs_device_ops mtd_devops = {
.find_first_sb = mtd_find_first_sb,
.find_last_sb = mtd_find_last_sb,
.readpage = mtd_readpage,
.writeseg = mtd_writeseg,
.erase = mtd_erase,
.sync = mtd_sync,
.put_device = mtd_put_device,
};
int logfs_get_sb_mtd(struct file_system_type *type, int flags,
int mtdnr, struct vfsmount *mnt)
{
struct mtd_info *mtd;
const struct logfs_device_ops *devops = &mtd_devops;
mtd = get_mtd_device(NULL, mtdnr);
return logfs_get_sb_device(type, flags, mtd, NULL, devops, mnt);
}

827
fs/logfs/dir.c Normal file
View File

@ -0,0 +1,827 @@
/*
* fs/logfs/dir.c - directory-related code
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
/*
* Atomic dir operations
*
* Directory operations are by default not atomic. Dentries and Inodes are
* created/removed/altered in seperate operations. Therefore we need to do
* a small amount of journaling.
*
* Create, link, mkdir, mknod and symlink all share the same function to do
* the work: __logfs_create. This function works in two atomic steps:
* 1. allocate inode (remember in journal)
* 2. allocate dentry (clear journal)
*
* As we can only get interrupted between the two, when the inode we just
* created is simply stored in the anchor. On next mount, if we were
* interrupted, we delete the inode. From a users point of view the
* operation never happened.
*
* Unlink and rmdir also share the same function: unlink. Again, this
* function works in two atomic steps
* 1. remove dentry (remember inode in journal)
* 2. unlink inode (clear journal)
*
* And again, on the next mount, if we were interrupted, we delete the inode.
* From a users point of view the operation succeeded.
*
* Rename is the real pain to deal with, harder than all the other methods
* combined. Depending on the circumstances we can run into three cases.
* A "target rename" where the target dentry already existed, a "local
* rename" where both parent directories are identical or a "cross-directory
* rename" in the remaining case.
*
* Local rename is atomic, as the old dentry is simply rewritten with a new
* name.
*
* Cross-directory rename works in two steps, similar to __logfs_create and
* logfs_unlink:
* 1. Write new dentry (remember old dentry in journal)
* 2. Remove old dentry (clear journal)
*
* Here we remember a dentry instead of an inode. On next mount, if we were
* interrupted, we delete the dentry. From a users point of view, the
* operation succeeded.
*
* Target rename works in three atomic steps:
* 1. Attach old inode to new dentry (remember old dentry and new inode)
* 2. Remove old dentry (still remember the new inode)
* 3. Remove victim inode
*
* Here we remember both an inode an a dentry. If we get interrupted
* between steps 1 and 2, we delete both the dentry and the inode. If
* we get interrupted between steps 2 and 3, we delete just the inode.
* In either case, the remaining objects are deleted on next mount. From
* a users point of view, the operation succeeded.
*/
static int write_dir(struct inode *dir, struct logfs_disk_dentry *dd,
loff_t pos)
{
return logfs_inode_write(dir, dd, sizeof(*dd), pos, WF_LOCK, NULL);
}
static int write_inode(struct inode *inode)
{
return __logfs_write_inode(inode, WF_LOCK);
}
static s64 dir_seek_data(struct inode *inode, s64 pos)
{
s64 new_pos = logfs_seek_data(inode, pos);
return max(pos, new_pos - 1);
}
static int beyond_eof(struct inode *inode, loff_t bix)
{
loff_t pos = bix << inode->i_sb->s_blocksize_bits;
return pos >= i_size_read(inode);
}
/*
* Prime value was chosen to be roughly 256 + 26. r5 hash uses 11,
* so short names (len <= 9) don't even occupy the complete 32bit name
* space. A prime >256 ensures short names quickly spread the 32bit
* name space. Add about 26 for the estimated amount of information
* of each character and pick a prime nearby, preferrably a bit-sparse
* one.
*/
static u32 hash_32(const char *s, int len, u32 seed)
{
u32 hash = seed;
int i;
for (i = 0; i < len; i++)
hash = hash * 293 + s[i];
return hash;
}
/*
* We have to satisfy several conflicting requirements here. Small
* directories should stay fairly compact and not require too many
* indirect blocks. The number of possible locations for a given hash
* should be small to make lookup() fast. And we should try hard not
* to overflow the 32bit name space or nfs and 32bit host systems will
* be unhappy.
*
* So we use the following scheme. First we reduce the hash to 0..15
* and try a direct block. If that is occupied we reduce the hash to
* 16..255 and try an indirect block. Same for 2x and 3x indirect
* blocks. Lastly we reduce the hash to 0x800_0000 .. 0xffff_ffff,
* but use buckets containing eight entries instead of a single one.
*
* Using 16 entries should allow for a reasonable amount of hash
* collisions, so the 32bit name space can be packed fairly tight
* before overflowing. Oh and currently we don't overflow but return
* and error.
*
* How likely are collisions? Doing the appropriate math is beyond me
* and the Bronstein textbook. But running a test program to brute
* force collisions for a couple of days showed that on average the
* first collision occurs after 598M entries, with 290M being the
* smallest result. Obviously 21 entries could already cause a
* collision if all entries are carefully chosen.
*/
static pgoff_t hash_index(u32 hash, int round)
{
u32 i0_blocks = I0_BLOCKS;
u32 i1_blocks = I1_BLOCKS;
u32 i2_blocks = I2_BLOCKS;
u32 i3_blocks = I3_BLOCKS;
switch (round) {
case 0:
return hash % i0_blocks;
case 1:
return i0_blocks + hash % (i1_blocks - i0_blocks);
case 2:
return i1_blocks + hash % (i2_blocks - i1_blocks);
case 3:
return i2_blocks + hash % (i3_blocks - i2_blocks);
case 4 ... 19:
return i3_blocks + 16 * (hash % (((1<<31) - i3_blocks) / 16))
+ round - 4;
}
BUG();
}
static struct page *logfs_get_dd_page(struct inode *dir, struct dentry *dentry)
{
struct qstr *name = &dentry->d_name;
struct page *page;
struct logfs_disk_dentry *dd;
u32 hash = hash_32(name->name, name->len, 0);
pgoff_t index;
int round;
if (name->len > LOGFS_MAX_NAMELEN)
return ERR_PTR(-ENAMETOOLONG);
for (round = 0; round < 20; round++) {
index = hash_index(hash, round);
if (beyond_eof(dir, index))
return NULL;
if (!logfs_exist_block(dir, index))
continue;
page = read_cache_page(dir->i_mapping, index,
(filler_t *)logfs_readpage, NULL);
if (IS_ERR(page))
return page;
dd = kmap_atomic(page, KM_USER0);
BUG_ON(dd->namelen == 0);
if (name->len != be16_to_cpu(dd->namelen) ||
memcmp(name->name, dd->name, name->len)) {
kunmap_atomic(dd, KM_USER0);
page_cache_release(page);
continue;
}
kunmap_atomic(dd, KM_USER0);
return page;
}
return NULL;
}
static int logfs_remove_inode(struct inode *inode)
{
int ret;
inode->i_nlink--;
ret = write_inode(inode);
LOGFS_BUG_ON(ret, inode->i_sb);
return ret;
}
static void abort_transaction(struct inode *inode, struct logfs_transaction *ta)
{
if (logfs_inode(inode)->li_block)
logfs_inode(inode)->li_block->ta = NULL;
kfree(ta);
}
static int logfs_unlink(struct inode *dir, struct dentry *dentry)
{
struct logfs_super *super = logfs_super(dir->i_sb);
struct inode *inode = dentry->d_inode;
struct logfs_transaction *ta;
struct page *page;
pgoff_t index;
int ret;
ta = kzalloc(sizeof(*ta), GFP_KERNEL);
if (!ta)
return -ENOMEM;
ta->state = UNLINK_1;
ta->ino = inode->i_ino;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
page = logfs_get_dd_page(dir, dentry);
if (!page) {
kfree(ta);
return -ENOENT;
}
if (IS_ERR(page)) {
kfree(ta);
return PTR_ERR(page);
}
index = page->index;
page_cache_release(page);
mutex_lock(&super->s_dirop_mutex);
logfs_add_transaction(dir, ta);
ret = logfs_delete(dir, index, NULL);
if (!ret)
ret = write_inode(dir);
if (ret) {
abort_transaction(dir, ta);
printk(KERN_ERR"LOGFS: unable to delete inode\n");
goto out;
}
ta->state = UNLINK_2;
logfs_add_transaction(inode, ta);
ret = logfs_remove_inode(inode);
out:
mutex_unlock(&super->s_dirop_mutex);
return ret;
}
static inline int logfs_empty_dir(struct inode *dir)
{
u64 data;
data = logfs_seek_data(dir, 0) << dir->i_sb->s_blocksize_bits;
return data >= i_size_read(dir);
}
static int logfs_rmdir(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
if (!logfs_empty_dir(inode))
return -ENOTEMPTY;
return logfs_unlink(dir, dentry);
}
/* FIXME: readdir currently has it's own dir_walk code. I don't see a good
* way to combine the two copies */
#define IMPLICIT_NODES 2
static int __logfs_readdir(struct file *file, void *buf, filldir_t filldir)
{
struct inode *dir = file->f_dentry->d_inode;
loff_t pos = file->f_pos - IMPLICIT_NODES;
struct page *page;
struct logfs_disk_dentry *dd;
int full;
BUG_ON(pos < 0);
for (;; pos++) {
if (beyond_eof(dir, pos))
break;
if (!logfs_exist_block(dir, pos)) {
/* deleted dentry */
pos = dir_seek_data(dir, pos);
continue;
}
page = read_cache_page(dir->i_mapping, pos,
(filler_t *)logfs_readpage, NULL);
if (IS_ERR(page))
return PTR_ERR(page);
dd = kmap_atomic(page, KM_USER0);
BUG_ON(dd->namelen == 0);
full = filldir(buf, (char *)dd->name, be16_to_cpu(dd->namelen),
pos, be64_to_cpu(dd->ino), dd->type);
kunmap_atomic(dd, KM_USER0);
page_cache_release(page);
if (full)
break;
}
file->f_pos = pos + IMPLICIT_NODES;
return 0;
}
static int logfs_readdir(struct file *file, void *buf, filldir_t filldir)
{
struct inode *inode = file->f_dentry->d_inode;
ino_t pino = parent_ino(file->f_dentry);
int err;
if (file->f_pos < 0)
return -EINVAL;
if (file->f_pos == 0) {
if (filldir(buf, ".", 1, 1, inode->i_ino, DT_DIR) < 0)
return 0;
file->f_pos++;
}
if (file->f_pos == 1) {
if (filldir(buf, "..", 2, 2, pino, DT_DIR) < 0)
return 0;
file->f_pos++;
}
err = __logfs_readdir(file, buf, filldir);
return err;
}
static void logfs_set_name(struct logfs_disk_dentry *dd, struct qstr *name)
{
dd->namelen = cpu_to_be16(name->len);
memcpy(dd->name, name->name, name->len);
}
static struct dentry *logfs_lookup(struct inode *dir, struct dentry *dentry,
struct nameidata *nd)
{
struct page *page;
struct logfs_disk_dentry *dd;
pgoff_t index;
u64 ino = 0;
struct inode *inode;
page = logfs_get_dd_page(dir, dentry);
if (IS_ERR(page))
return ERR_CAST(page);
if (!page) {
d_add(dentry, NULL);
return NULL;
}
index = page->index;
dd = kmap_atomic(page, KM_USER0);
ino = be64_to_cpu(dd->ino);
kunmap_atomic(dd, KM_USER0);
page_cache_release(page);
inode = logfs_iget(dir->i_sb, ino);
if (IS_ERR(inode)) {
printk(KERN_ERR"LogFS: Cannot read inode #%llx for dentry (%lx, %lx)n",
ino, dir->i_ino, index);
return ERR_CAST(inode);
}
return d_splice_alias(inode, dentry);
}
static void grow_dir(struct inode *dir, loff_t index)
{
index = (index + 1) << dir->i_sb->s_blocksize_bits;
if (i_size_read(dir) < index)
i_size_write(dir, index);
}
static int logfs_write_dir(struct inode *dir, struct dentry *dentry,
struct inode *inode)
{
struct page *page;
struct logfs_disk_dentry *dd;
u32 hash = hash_32(dentry->d_name.name, dentry->d_name.len, 0);
pgoff_t index;
int round, err;
for (round = 0; round < 20; round++) {
index = hash_index(hash, round);
if (logfs_exist_block(dir, index))
continue;
page = find_or_create_page(dir->i_mapping, index, GFP_KERNEL);
if (!page)
return -ENOMEM;
dd = kmap_atomic(page, KM_USER0);
memset(dd, 0, sizeof(*dd));
dd->ino = cpu_to_be64(inode->i_ino);
dd->type = logfs_type(inode);
logfs_set_name(dd, &dentry->d_name);
kunmap_atomic(dd, KM_USER0);
err = logfs_write_buf(dir, page, WF_LOCK);
unlock_page(page);
page_cache_release(page);
if (!err)
grow_dir(dir, index);
return err;
}
/* FIXME: Is there a better return value? In most cases neither
* the filesystem nor the directory are full. But we have had
* too many collisions for this particular hash and no fallback.
*/
return -ENOSPC;
}
static int __logfs_create(struct inode *dir, struct dentry *dentry,
struct inode *inode, const char *dest, long destlen)
{
struct logfs_super *super = logfs_super(dir->i_sb);
struct logfs_inode *li = logfs_inode(inode);
struct logfs_transaction *ta;
int ret;
ta = kzalloc(sizeof(*ta), GFP_KERNEL);
if (!ta)
return -ENOMEM;
ta->state = CREATE_1;
ta->ino = inode->i_ino;
mutex_lock(&super->s_dirop_mutex);
logfs_add_transaction(inode, ta);
if (dest) {
/* symlink */
ret = logfs_inode_write(inode, dest, destlen, 0, WF_LOCK, NULL);
if (!ret)
ret = write_inode(inode);
} else {
/* creat/mkdir/mknod */
ret = write_inode(inode);
}
if (ret) {
abort_transaction(inode, ta);
li->li_flags |= LOGFS_IF_STILLBORN;
/* FIXME: truncate symlink */
inode->i_nlink--;
iput(inode);
goto out;
}
ta->state = CREATE_2;
logfs_add_transaction(dir, ta);
ret = logfs_write_dir(dir, dentry, inode);
/* sync directory */
if (!ret)
ret = write_inode(dir);
if (ret) {
logfs_del_transaction(dir, ta);
ta->state = CREATE_2;
logfs_add_transaction(inode, ta);
logfs_remove_inode(inode);
iput(inode);
goto out;
}
d_instantiate(dentry, inode);
out:
mutex_unlock(&super->s_dirop_mutex);
return ret;
}
static int logfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
struct inode *inode;
/*
* FIXME: why do we have to fill in S_IFDIR, while the mode is
* correct for mknod, creat, etc.? Smells like the vfs *should*
* do it for us but for some reason fails to do so.
*/
inode = logfs_new_inode(dir, S_IFDIR | mode);
if (IS_ERR(inode))
return PTR_ERR(inode);
inode->i_op = &logfs_dir_iops;
inode->i_fop = &logfs_dir_fops;
return __logfs_create(dir, dentry, inode, NULL, 0);
}
static int logfs_create(struct inode *dir, struct dentry *dentry, int mode,
struct nameidata *nd)
{
struct inode *inode;
inode = logfs_new_inode(dir, mode);
if (IS_ERR(inode))
return PTR_ERR(inode);
inode->i_op = &logfs_reg_iops;
inode->i_fop = &logfs_reg_fops;
inode->i_mapping->a_ops = &logfs_reg_aops;
return __logfs_create(dir, dentry, inode, NULL, 0);
}
static int logfs_mknod(struct inode *dir, struct dentry *dentry, int mode,
dev_t rdev)
{
struct inode *inode;
if (dentry->d_name.len > LOGFS_MAX_NAMELEN)
return -ENAMETOOLONG;
inode = logfs_new_inode(dir, mode);
if (IS_ERR(inode))
return PTR_ERR(inode);
init_special_inode(inode, mode, rdev);
return __logfs_create(dir, dentry, inode, NULL, 0);
}
static int logfs_symlink(struct inode *dir, struct dentry *dentry,
const char *target)
{
struct inode *inode;
size_t destlen = strlen(target) + 1;
if (destlen > dir->i_sb->s_blocksize)
return -ENAMETOOLONG;
inode = logfs_new_inode(dir, S_IFLNK | 0777);
if (IS_ERR(inode))
return PTR_ERR(inode);
inode->i_op = &logfs_symlink_iops;
inode->i_mapping->a_ops = &logfs_reg_aops;
return __logfs_create(dir, dentry, inode, target, destlen);
}
static int logfs_permission(struct inode *inode, int mask)
{
return generic_permission(inode, mask, NULL);
}
static int logfs_link(struct dentry *old_dentry, struct inode *dir,
struct dentry *dentry)
{
struct inode *inode = old_dentry->d_inode;
if (inode->i_nlink >= LOGFS_LINK_MAX)
return -EMLINK;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
atomic_inc(&inode->i_count);
inode->i_nlink++;
mark_inode_dirty_sync(inode);
return __logfs_create(dir, dentry, inode, NULL, 0);
}
static int logfs_get_dd(struct inode *dir, struct dentry *dentry,
struct logfs_disk_dentry *dd, loff_t *pos)
{
struct page *page;
void *map;
page = logfs_get_dd_page(dir, dentry);
if (IS_ERR(page))
return PTR_ERR(page);
*pos = page->index;
map = kmap_atomic(page, KM_USER0);
memcpy(dd, map, sizeof(*dd));
kunmap_atomic(map, KM_USER0);
page_cache_release(page);
return 0;
}
static int logfs_delete_dd(struct inode *dir, loff_t pos)
{
/*
* Getting called with pos somewhere beyond eof is either a goofup
* within this file or means someone maliciously edited the
* (crc-protected) journal.
*/
BUG_ON(beyond_eof(dir, pos));
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
log_dir(" Delete dentry (%lx, %llx)\n", dir->i_ino, pos);
return logfs_delete(dir, pos, NULL);
}
/*
* Cross-directory rename, target does not exist. Just a little nasty.
* Create a new dentry in the target dir, then remove the old dentry,
* all the while taking care to remember our operation in the journal.
*/
static int logfs_rename_cross(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry)
{
struct logfs_super *super = logfs_super(old_dir->i_sb);
struct logfs_disk_dentry dd;
struct logfs_transaction *ta;
loff_t pos;
int err;
/* 1. locate source dd */
err = logfs_get_dd(old_dir, old_dentry, &dd, &pos);
if (err)
return err;
ta = kzalloc(sizeof(*ta), GFP_KERNEL);
if (!ta)
return -ENOMEM;
ta->state = CROSS_RENAME_1;
ta->dir = old_dir->i_ino;
ta->pos = pos;
/* 2. write target dd */
mutex_lock(&super->s_dirop_mutex);
logfs_add_transaction(new_dir, ta);
err = logfs_write_dir(new_dir, new_dentry, old_dentry->d_inode);
if (!err)
err = write_inode(new_dir);
if (err) {
super->s_rename_dir = 0;
super->s_rename_pos = 0;
abort_transaction(new_dir, ta);
goto out;
}
/* 3. remove source dd */
ta->state = CROSS_RENAME_2;
logfs_add_transaction(old_dir, ta);
err = logfs_delete_dd(old_dir, pos);
if (!err)
err = write_inode(old_dir);
LOGFS_BUG_ON(err, old_dir->i_sb);
out:
mutex_unlock(&super->s_dirop_mutex);
return err;
}
static int logfs_replace_inode(struct inode *dir, struct dentry *dentry,
struct logfs_disk_dentry *dd, struct inode *inode)
{
loff_t pos;
int err;
err = logfs_get_dd(dir, dentry, dd, &pos);
if (err)
return err;
dd->ino = cpu_to_be64(inode->i_ino);
dd->type = logfs_type(inode);
err = write_dir(dir, dd, pos);
if (err)
return err;
log_dir("Replace dentry (%lx, %llx) %s -> %llx\n", dir->i_ino, pos,
dd->name, be64_to_cpu(dd->ino));
return write_inode(dir);
}
/* Target dentry exists - the worst case. We need to attach the source
* inode to the target dentry, then remove the orphaned target inode and
* source dentry.
*/
static int logfs_rename_target(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry)
{
struct logfs_super *super = logfs_super(old_dir->i_sb);
struct inode *old_inode = old_dentry->d_inode;
struct inode *new_inode = new_dentry->d_inode;
int isdir = S_ISDIR(old_inode->i_mode);
struct logfs_disk_dentry dd;
struct logfs_transaction *ta;
loff_t pos;
int err;
BUG_ON(isdir != S_ISDIR(new_inode->i_mode));
if (isdir) {
if (!logfs_empty_dir(new_inode))
return -ENOTEMPTY;
}
/* 1. locate source dd */
err = logfs_get_dd(old_dir, old_dentry, &dd, &pos);
if (err)
return err;
ta = kzalloc(sizeof(*ta), GFP_KERNEL);
if (!ta)
return -ENOMEM;
ta->state = TARGET_RENAME_1;
ta->dir = old_dir->i_ino;
ta->pos = pos;
ta->ino = new_inode->i_ino;
/* 2. attach source inode to target dd */
mutex_lock(&super->s_dirop_mutex);
logfs_add_transaction(new_dir, ta);
err = logfs_replace_inode(new_dir, new_dentry, &dd, old_inode);
if (err) {
super->s_rename_dir = 0;
super->s_rename_pos = 0;
super->s_victim_ino = 0;
abort_transaction(new_dir, ta);
goto out;
}
/* 3. remove source dd */
ta->state = TARGET_RENAME_2;
logfs_add_transaction(old_dir, ta);
err = logfs_delete_dd(old_dir, pos);
if (!err)
err = write_inode(old_dir);
LOGFS_BUG_ON(err, old_dir->i_sb);
/* 4. remove target inode */
ta->state = TARGET_RENAME_3;
logfs_add_transaction(new_inode, ta);
err = logfs_remove_inode(new_inode);
out:
mutex_unlock(&super->s_dirop_mutex);
return err;
}
static int logfs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry)
{
if (new_dentry->d_inode)
return logfs_rename_target(old_dir, old_dentry,
new_dir, new_dentry);
return logfs_rename_cross(old_dir, old_dentry, new_dir, new_dentry);
}
/* No locking done here, as this is called before .get_sb() returns. */
int logfs_replay_journal(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct inode *inode;
u64 ino, pos;
int err;
if (super->s_victim_ino) {
/* delete victim inode */
ino = super->s_victim_ino;
printk(KERN_INFO"LogFS: delete unmapped inode #%llx\n", ino);
inode = logfs_iget(sb, ino);
if (IS_ERR(inode))
goto fail;
LOGFS_BUG_ON(i_size_read(inode) > 0, sb);
super->s_victim_ino = 0;
err = logfs_remove_inode(inode);
iput(inode);
if (err) {
super->s_victim_ino = ino;
goto fail;
}
}
if (super->s_rename_dir) {
/* delete old dd from rename */
ino = super->s_rename_dir;
pos = super->s_rename_pos;
printk(KERN_INFO"LogFS: delete unbacked dentry (%llx, %llx)\n",
ino, pos);
inode = logfs_iget(sb, ino);
if (IS_ERR(inode))
goto fail;
super->s_rename_dir = 0;
super->s_rename_pos = 0;
err = logfs_delete_dd(inode, pos);
iput(inode);
if (err) {
super->s_rename_dir = ino;
super->s_rename_pos = pos;
goto fail;
}
}
return 0;
fail:
LOGFS_BUG(sb);
return -EIO;
}
const struct inode_operations logfs_symlink_iops = {
.readlink = generic_readlink,
.follow_link = page_follow_link_light,
};
const struct inode_operations logfs_dir_iops = {
.create = logfs_create,
.link = logfs_link,
.lookup = logfs_lookup,
.mkdir = logfs_mkdir,
.mknod = logfs_mknod,
.rename = logfs_rename,
.rmdir = logfs_rmdir,
.permission = logfs_permission,
.symlink = logfs_symlink,
.unlink = logfs_unlink,
};
const struct file_operations logfs_dir_fops = {
.fsync = logfs_fsync,
.ioctl = logfs_ioctl,
.readdir = logfs_readdir,
.read = generic_read_dir,
};

263
fs/logfs/file.c Normal file
View File

@ -0,0 +1,263 @@
/*
* fs/logfs/file.c - prepare_write, commit_write and friends
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/sched.h>
#include <linux/writeback.h>
static int logfs_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
struct page *page;
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
*pagep = page;
if ((len == PAGE_CACHE_SIZE) || PageUptodate(page))
return 0;
if ((pos & PAGE_CACHE_MASK) >= i_size_read(inode)) {
unsigned start = pos & (PAGE_CACHE_SIZE - 1);
unsigned end = start + len;
/* Reading beyond i_size is simple: memset to zero */
zero_user_segments(page, 0, start, end, PAGE_CACHE_SIZE);
return 0;
}
return logfs_readpage_nolock(page);
}
static int logfs_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied, struct page *page,
void *fsdata)
{
struct inode *inode = mapping->host;
pgoff_t index = page->index;
unsigned start = pos & (PAGE_CACHE_SIZE - 1);
unsigned end = start + copied;
int ret = 0;
BUG_ON(PAGE_CACHE_SIZE != inode->i_sb->s_blocksize);
BUG_ON(page->index > I3_BLOCKS);
if (copied < len) {
/*
* Short write of a non-initialized paged. Just tell userspace
* to retry the entire page.
*/
if (!PageUptodate(page)) {
copied = 0;
goto out;
}
}
if (copied == 0)
goto out; /* FIXME: do we need to update inode? */
if (i_size_read(inode) < (index << PAGE_CACHE_SHIFT) + end) {
i_size_write(inode, (index << PAGE_CACHE_SHIFT) + end);
mark_inode_dirty_sync(inode);
}
SetPageUptodate(page);
if (!PageDirty(page)) {
if (!get_page_reserve(inode, page))
__set_page_dirty_nobuffers(page);
else
ret = logfs_write_buf(inode, page, WF_LOCK);
}
out:
unlock_page(page);
page_cache_release(page);
return ret ? ret : copied;
}
int logfs_readpage(struct file *file, struct page *page)
{
int ret;
ret = logfs_readpage_nolock(page);
unlock_page(page);
return ret;
}
/* Clear the page's dirty flag in the radix tree. */
/* TODO: mucking with PageWriteback is silly. Add a generic function to clear
* the dirty bit from the radix tree for filesystems that don't have to wait
* for page writeback to finish (i.e. any compressing filesystem).
*/
static void clear_radix_tree_dirty(struct page *page)
{
BUG_ON(PagePrivate(page) || page->private);
set_page_writeback(page);
end_page_writeback(page);
}
static int __logfs_writepage(struct page *page)
{
struct inode *inode = page->mapping->host;
int err;
err = logfs_write_buf(inode, page, WF_LOCK);
if (err)
set_page_dirty(page);
else
clear_radix_tree_dirty(page);
unlock_page(page);
return err;
}
static int logfs_writepage(struct page *page, struct writeback_control *wbc)
{
struct inode *inode = page->mapping->host;
loff_t i_size = i_size_read(inode);
pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
unsigned offset;
u64 bix;
level_t level;
log_file("logfs_writepage(%lx, %lx, %p)\n", inode->i_ino, page->index,
page);
logfs_unpack_index(page->index, &bix, &level);
/* Indirect blocks are never truncated */
if (level != 0)
return __logfs_writepage(page);
/*
* TODO: everything below is a near-verbatim copy of nobh_writepage().
* The relevant bits should be factored out after logfs is merged.
*/
/* Is the page fully inside i_size? */
if (bix < end_index)
return __logfs_writepage(page);
/* Is the page fully outside i_size? (truncate in progress) */
offset = i_size & (PAGE_CACHE_SIZE-1);
if (bix > end_index || offset == 0) {
unlock_page(page);
return 0; /* don't care */
}
/*
* The page straddles i_size. It must be zeroed out on each and every
* writepage invokation because it may be mmapped. "A file is mapped
* in multiples of the page size. For a file that is not a multiple of
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
zero_user_segment(page, offset, PAGE_CACHE_SIZE);
return __logfs_writepage(page);
}
static void logfs_invalidatepage(struct page *page, unsigned long offset)
{
move_page_to_btree(page);
BUG_ON(PagePrivate(page) || page->private);
}
static int logfs_releasepage(struct page *page, gfp_t only_xfs_uses_this)
{
return 0; /* None of these are easy to release */
}
int logfs_ioctl(struct inode *inode, struct file *file, unsigned int cmd,
unsigned long arg)
{
struct logfs_inode *li = logfs_inode(inode);
unsigned int oldflags, flags;
int err;
switch (cmd) {
case FS_IOC_GETFLAGS:
flags = li->li_flags & LOGFS_FL_USER_VISIBLE;
return put_user(flags, (int __user *)arg);
case FS_IOC_SETFLAGS:
if (IS_RDONLY(inode))
return -EROFS;
if (!is_owner_or_cap(inode))
return -EACCES;
err = get_user(flags, (int __user *)arg);
if (err)
return err;
mutex_lock(&inode->i_mutex);
oldflags = li->li_flags;
flags &= LOGFS_FL_USER_MODIFIABLE;
flags |= oldflags & ~LOGFS_FL_USER_MODIFIABLE;
li->li_flags = flags;
mutex_unlock(&inode->i_mutex);
inode->i_ctime = CURRENT_TIME;
mark_inode_dirty_sync(inode);
return 0;
default:
return -ENOTTY;
}
}
int logfs_fsync(struct file *file, struct dentry *dentry, int datasync)
{
struct super_block *sb = dentry->d_inode->i_sb;
struct logfs_super *super = logfs_super(sb);
/* FIXME: write anchor */
super->s_devops->sync(sb);
return 0;
}
static int logfs_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = dentry->d_inode;
int err = 0;
if (attr->ia_valid & ATTR_SIZE)
err = logfs_truncate(inode, attr->ia_size);
attr->ia_valid &= ~ATTR_SIZE;
if (!err)
err = inode_change_ok(inode, attr);
if (!err)
err = inode_setattr(inode, attr);
return err;
}
const struct inode_operations logfs_reg_iops = {
.setattr = logfs_setattr,
};
const struct file_operations logfs_reg_fops = {
.aio_read = generic_file_aio_read,
.aio_write = generic_file_aio_write,
.fsync = logfs_fsync,
.ioctl = logfs_ioctl,
.llseek = generic_file_llseek,
.mmap = generic_file_readonly_mmap,
.open = generic_file_open,
.read = do_sync_read,
.write = do_sync_write,
};
const struct address_space_operations logfs_reg_aops = {
.invalidatepage = logfs_invalidatepage,
.readpage = logfs_readpage,
.releasepage = logfs_releasepage,
.set_page_dirty = __set_page_dirty_nobuffers,
.writepage = logfs_writepage,
.writepages = generic_writepages,
.write_begin = logfs_write_begin,
.write_end = logfs_write_end,
};

730
fs/logfs/gc.c Normal file
View File

@ -0,0 +1,730 @@
/*
* fs/logfs/gc.c - garbage collection code
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/sched.h>
/*
* Wear leveling needs to kick in when the difference between low erase
* counts and high erase counts gets too big. A good value for "too big"
* may be somewhat below 10% of maximum erase count for the device.
* Why not 397, to pick a nice round number with no specific meaning? :)
*
* WL_RATELIMIT is the minimum time between two wear level events. A huge
* number of segments may fulfil the requirements for wear leveling at the
* same time. If that happens we don't want to cause a latency from hell,
* but just gently pick one segment every so often and minimize overhead.
*/
#define WL_DELTA 397
#define WL_RATELIMIT 100
#define MAX_OBJ_ALIASES 2600
#define SCAN_RATIO 512 /* number of scanned segments per gc'd segment */
#define LIST_SIZE 64 /* base size of candidate lists */
#define SCAN_ROUNDS 128 /* maximum number of complete medium scans */
#define SCAN_ROUNDS_HIGH 4 /* maximum number of higher-level scans */
static int no_free_segments(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
return super->s_free_list.count;
}
/* journal has distance -1, top-most ifile layer distance 0 */
static u8 root_distance(struct super_block *sb, gc_level_t __gc_level)
{
struct logfs_super *super = logfs_super(sb);
u8 gc_level = (__force u8)__gc_level;
switch (gc_level) {
case 0: /* fall through */
case 1: /* fall through */
case 2: /* fall through */
case 3:
/* file data or indirect blocks */
return super->s_ifile_levels + super->s_iblock_levels - gc_level;
case 6: /* fall through */
case 7: /* fall through */
case 8: /* fall through */
case 9:
/* inode file data or indirect blocks */
return super->s_ifile_levels - (gc_level - 6);
default:
printk(KERN_ERR"LOGFS: segment of unknown level %x found\n",
gc_level);
WARN_ON(1);
return super->s_ifile_levels + super->s_iblock_levels;
}
}
static int segment_is_reserved(struct super_block *sb, u32 segno)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area;
void *reserved;
int i;
/* Some segments are reserved. Just pretend they were all valid */
reserved = btree_lookup32(&super->s_reserved_segments, segno);
if (reserved)
return 1;
/* Currently open segments */
for_each_area(i) {
area = super->s_area[i];
if (area->a_is_open && area->a_segno == segno)
return 1;
}
return 0;
}
static void logfs_mark_segment_bad(struct super_block *sb, u32 segno)
{
BUG();
}
/*
* Returns the bytes consumed by valid objects in this segment. Object headers
* are counted, the segment header is not.
*/
static u32 logfs_valid_bytes(struct super_block *sb, u32 segno, u32 *ec,
gc_level_t *gc_level)
{
struct logfs_segment_entry se;
u32 ec_level;
logfs_get_segment_entry(sb, segno, &se);
if (se.ec_level == cpu_to_be32(BADSEG) ||
se.valid == cpu_to_be32(RESERVED))
return RESERVED;
ec_level = be32_to_cpu(se.ec_level);
*ec = ec_level >> 4;
*gc_level = GC_LEVEL(ec_level & 0xf);
return be32_to_cpu(se.valid);
}
static void logfs_cleanse_block(struct super_block *sb, u64 ofs, u64 ino,
u64 bix, gc_level_t gc_level)
{
struct inode *inode;
int err, cookie;
inode = logfs_safe_iget(sb, ino, &cookie);
err = logfs_rewrite_block(inode, bix, ofs, gc_level, 0);
BUG_ON(err);
logfs_safe_iput(inode, cookie);
}
static u32 logfs_gc_segment(struct super_block *sb, u32 segno, u8 dist)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_segment_header sh;
struct logfs_object_header oh;
u64 ofs, ino, bix;
u32 seg_ofs, logical_segno, cleaned = 0;
int err, len, valid;
gc_level_t gc_level;
LOGFS_BUG_ON(segment_is_reserved(sb, segno), sb);
btree_insert32(&super->s_reserved_segments, segno, (void *)1, GFP_NOFS);
err = wbuf_read(sb, dev_ofs(sb, segno, 0), sizeof(sh), &sh);
BUG_ON(err);
gc_level = GC_LEVEL(sh.level);
logical_segno = be32_to_cpu(sh.segno);
if (sh.crc != logfs_crc32(&sh, sizeof(sh), 4)) {
logfs_mark_segment_bad(sb, segno);
cleaned = -1;
goto out;
}
for (seg_ofs = LOGFS_SEGMENT_HEADERSIZE;
seg_ofs + sizeof(oh) < super->s_segsize; ) {
ofs = dev_ofs(sb, logical_segno, seg_ofs);
err = wbuf_read(sb, dev_ofs(sb, segno, seg_ofs), sizeof(oh),
&oh);
BUG_ON(err);
if (!memchr_inv(&oh, 0xff, sizeof(oh)))
break;
if (oh.crc != logfs_crc32(&oh, sizeof(oh) - 4, 4)) {
logfs_mark_segment_bad(sb, segno);
cleaned = super->s_segsize - 1;
goto out;
}
ino = be64_to_cpu(oh.ino);
bix = be64_to_cpu(oh.bix);
len = sizeof(oh) + be16_to_cpu(oh.len);
valid = logfs_is_valid_block(sb, ofs, ino, bix, gc_level);
if (valid == 1) {
logfs_cleanse_block(sb, ofs, ino, bix, gc_level);
cleaned += len;
} else if (valid == 2) {
/* Will be invalid upon journal commit */
cleaned += len;
}
seg_ofs += len;
}
out:
btree_remove32(&super->s_reserved_segments, segno);
return cleaned;
}
static struct gc_candidate *add_list(struct gc_candidate *cand,
struct candidate_list *list)
{
struct rb_node **p = &list->rb_tree.rb_node;
struct rb_node *parent = NULL;
struct gc_candidate *cur;
int comp;
cand->list = list;
while (*p) {
parent = *p;
cur = rb_entry(parent, struct gc_candidate, rb_node);
if (list->sort_by_ec)
comp = cand->erase_count < cur->erase_count;
else
comp = cand->valid < cur->valid;
if (comp)
p = &parent->rb_left;
else
p = &parent->rb_right;
}
rb_link_node(&cand->rb_node, parent, p);
rb_insert_color(&cand->rb_node, &list->rb_tree);
if (list->count <= list->maxcount) {
list->count++;
return NULL;
}
cand = rb_entry(rb_last(&list->rb_tree), struct gc_candidate, rb_node);
rb_erase(&cand->rb_node, &list->rb_tree);
cand->list = NULL;
return cand;
}
static void remove_from_list(struct gc_candidate *cand)
{
struct candidate_list *list = cand->list;
rb_erase(&cand->rb_node, &list->rb_tree);
list->count--;
}
static void free_candidate(struct super_block *sb, struct gc_candidate *cand)
{
struct logfs_super *super = logfs_super(sb);
btree_remove32(&super->s_cand_tree, cand->segno);
kfree(cand);
}
u32 get_best_cand(struct super_block *sb, struct candidate_list *list, u32 *ec)
{
struct gc_candidate *cand;
u32 segno;
BUG_ON(list->count == 0);
cand = rb_entry(rb_first(&list->rb_tree), struct gc_candidate, rb_node);
remove_from_list(cand);
segno = cand->segno;
if (ec)
*ec = cand->erase_count;
free_candidate(sb, cand);
return segno;
}
/*
* We have several lists to manage segments with. The reserve_list is used to
* deal with bad blocks. We try to keep the best (lowest ec) segments on this
* list.
* The free_list contains free segments for normal usage. It usually gets the
* second pick after the reserve_list. But when the free_list is running short
* it is more important to keep the free_list full than to keep a reserve.
*
* Segments that are not free are put onto a per-level low_list. If we have
* to run garbage collection, we pick a candidate from there. All segments on
* those lists should have at least some free space so GC will make progress.
*
* And last we have the ec_list, which is used to pick segments for wear
* leveling.
*
* If all appropriate lists are full, we simply free the candidate and forget
* about that segment for a while. We have better candidates for each purpose.
*/
static void __add_candidate(struct super_block *sb, struct gc_candidate *cand)
{
struct logfs_super *super = logfs_super(sb);
u32 full = super->s_segsize - LOGFS_SEGMENT_RESERVE;
if (cand->valid == 0) {
/* 100% free segments */
log_gc_noisy("add reserve segment %x (ec %x) at %llx\n",
cand->segno, cand->erase_count,
dev_ofs(sb, cand->segno, 0));
cand = add_list(cand, &super->s_reserve_list);
if (cand) {
log_gc_noisy("add free segment %x (ec %x) at %llx\n",
cand->segno, cand->erase_count,
dev_ofs(sb, cand->segno, 0));
cand = add_list(cand, &super->s_free_list);
}
} else {
/* good candidates for Garbage Collection */
if (cand->valid < full)
cand = add_list(cand, &super->s_low_list[cand->dist]);
/* good candidates for wear leveling,
* segments that were recently written get ignored */
if (cand)
cand = add_list(cand, &super->s_ec_list);
}
if (cand)
free_candidate(sb, cand);
}
static int add_candidate(struct super_block *sb, u32 segno, u32 valid, u32 ec,
u8 dist)
{
struct logfs_super *super = logfs_super(sb);
struct gc_candidate *cand;
cand = kmalloc(sizeof(*cand), GFP_NOFS);
if (!cand)
return -ENOMEM;
cand->segno = segno;
cand->valid = valid;
cand->erase_count = ec;
cand->dist = dist;
btree_insert32(&super->s_cand_tree, segno, cand, GFP_NOFS);
__add_candidate(sb, cand);
return 0;
}
static void remove_segment_from_lists(struct super_block *sb, u32 segno)
{
struct logfs_super *super = logfs_super(sb);
struct gc_candidate *cand;
cand = btree_lookup32(&super->s_cand_tree, segno);
if (cand) {
remove_from_list(cand);
free_candidate(sb, cand);
}
}
static void scan_segment(struct super_block *sb, u32 segno)
{
u32 valid, ec = 0;
gc_level_t gc_level = 0;
u8 dist;
if (segment_is_reserved(sb, segno))
return;
remove_segment_from_lists(sb, segno);
valid = logfs_valid_bytes(sb, segno, &ec, &gc_level);
if (valid == RESERVED)
return;
dist = root_distance(sb, gc_level);
add_candidate(sb, segno, valid, ec, dist);
}
static struct gc_candidate *first_in_list(struct candidate_list *list)
{
if (list->count == 0)
return NULL;
return rb_entry(rb_first(&list->rb_tree), struct gc_candidate, rb_node);
}
/*
* Find the best segment for garbage collection. Main criterion is
* the segment requiring the least effort to clean. Secondary
* criterion is to GC on the lowest level available.
*
* So we search the least effort segment on the lowest level first,
* then move up and pick another segment iff is requires significantly
* less effort. Hence the LOGFS_MAX_OBJECTSIZE in the comparison.
*/
static struct gc_candidate *get_candidate(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int i, max_dist;
struct gc_candidate *cand = NULL, *this;
max_dist = min(no_free_segments(sb), LOGFS_NO_AREAS);
for (i = max_dist; i >= 0; i--) {
this = first_in_list(&super->s_low_list[i]);
if (!this)
continue;
if (!cand)
cand = this;
if (this->valid + LOGFS_MAX_OBJECTSIZE <= cand->valid)
cand = this;
}
return cand;
}
static int __logfs_gc_once(struct super_block *sb, struct gc_candidate *cand)
{
struct logfs_super *super = logfs_super(sb);
gc_level_t gc_level;
u32 cleaned, valid, segno, ec;
u8 dist;
if (!cand) {
log_gc("GC attempted, but no candidate found\n");
return 0;
}
segno = cand->segno;
dist = cand->dist;
valid = logfs_valid_bytes(sb, segno, &ec, &gc_level);
free_candidate(sb, cand);
log_gc("GC segment #%02x at %llx, %x required, %x free, %x valid, %llx free\n",
segno, (u64)segno << super->s_segshift,
dist, no_free_segments(sb), valid,
super->s_free_bytes);
cleaned = logfs_gc_segment(sb, segno, dist);
log_gc("GC segment #%02x complete - now %x valid\n", segno,
valid - cleaned);
BUG_ON(cleaned != valid);
return 1;
}
static int logfs_gc_once(struct super_block *sb)
{
struct gc_candidate *cand;
cand = get_candidate(sb);
if (cand)
remove_from_list(cand);
return __logfs_gc_once(sb, cand);
}
/* returns 1 if a wrap occurs, 0 otherwise */
static int logfs_scan_some(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
u32 segno;
int i, ret = 0;
segno = super->s_sweeper;
for (i = SCAN_RATIO; i > 0; i--) {
segno++;
if (segno >= super->s_no_segs) {
segno = 0;
ret = 1;
/* Break out of the loop. We want to read a single
* block from the segment size on next invocation if
* SCAN_RATIO is set to match block size
*/
break;
}
scan_segment(sb, segno);
}
super->s_sweeper = segno;
return ret;
}
/*
* In principle, this function should loop forever, looking for GC candidates
* and moving data. LogFS is designed in such a way that this loop is
* guaranteed to terminate.
*
* Limiting the loop to some iterations serves purely to catch cases when
* these guarantees have failed. An actual endless loop is an obvious bug
* and should be reported as such.
*/
static void __logfs_gc_pass(struct super_block *sb, int target)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_block *block;
int round, progress, last_progress = 0;
if (no_free_segments(sb) >= target &&
super->s_no_object_aliases < MAX_OBJ_ALIASES)
return;
log_gc("__logfs_gc_pass(%x)\n", target);
for (round = 0; round < SCAN_ROUNDS; ) {
if (no_free_segments(sb) >= target)
goto write_alias;
/* Sync in-memory state with on-medium state in case they
* diverged */
logfs_write_anchor(sb);
round += logfs_scan_some(sb);
if (no_free_segments(sb) >= target)
goto write_alias;
progress = logfs_gc_once(sb);
if (progress)
last_progress = round;
else if (round - last_progress > 2)
break;
continue;
/*
* The goto logic is nasty, I just don't know a better way to
* code it. GC is supposed to ensure two things:
* 1. Enough free segments are available.
* 2. The number of aliases is bounded.
* When 1. is achieved, we take a look at 2. and write back
* some alias-containing blocks, if necessary. However, after
* each such write we need to go back to 1., as writes can
* consume free segments.
*/
write_alias:
if (super->s_no_object_aliases < MAX_OBJ_ALIASES)
return;
if (list_empty(&super->s_object_alias)) {
/* All aliases are still in btree */
return;
}
log_gc("Write back one alias\n");
block = list_entry(super->s_object_alias.next,
struct logfs_block, alias_list);
block->ops->write_block(block);
/*
* To round off the nasty goto logic, we reset round here. It
* is a safety-net for GC not making any progress and limited
* to something reasonably small. If incremented it for every
* single alias, the loop could terminate rather quickly.
*/
round = 0;
}
LOGFS_BUG(sb);
}
static int wl_ratelimit(struct super_block *sb, u64 *next_event)
{
struct logfs_super *super = logfs_super(sb);
if (*next_event < super->s_gec) {
*next_event = super->s_gec + WL_RATELIMIT;
return 0;
}
return 1;
}
static void logfs_wl_pass(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct gc_candidate *wl_cand, *free_cand;
if (wl_ratelimit(sb, &super->s_wl_gec_ostore))
return;
wl_cand = first_in_list(&super->s_ec_list);
if (!wl_cand)
return;
free_cand = first_in_list(&super->s_free_list);
if (!free_cand)
return;
if (wl_cand->erase_count < free_cand->erase_count + WL_DELTA) {
remove_from_list(wl_cand);
__logfs_gc_once(sb, wl_cand);
}
}
/*
* The journal needs wear leveling as well. But moving the journal is an
* expensive operation so we try to avoid it as much as possible. And if we
* have to do it, we move the whole journal, not individual segments.
*
* Ratelimiting is not strictly necessary here, it mainly serves to avoid the
* calculations. First we check whether moving the journal would be a
* significant improvement. That means that a) the current journal segments
* have more wear than the future journal segments and b) the current journal
* segments have more wear than normal ostore segments.
* Rationale for b) is that we don't have to move the journal if it is aging
* less than the ostore, even if the reserve segments age even less (they are
* excluded from wear leveling, after all).
* Next we check that the superblocks have less wear than the journal. Since
* moving the journal requires writing the superblocks, we have to protect the
* superblocks even more than the journal.
*
* Also we double the acceptable wear difference, compared to ostore wear
* leveling. Journal data is read and rewritten rapidly, comparatively. So
* soft errors have much less time to accumulate and we allow the journal to
* be a bit worse than the ostore.
*/
static void logfs_journal_wl_pass(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct gc_candidate *cand;
u32 min_journal_ec = -1, max_reserve_ec = 0;
int i;
if (wl_ratelimit(sb, &super->s_wl_gec_journal))
return;
if (super->s_reserve_list.count < super->s_no_journal_segs) {
/* Reserve is not full enough to move complete journal */
return;
}
journal_for_each(i)
if (super->s_journal_seg[i])
min_journal_ec = min(min_journal_ec,
super->s_journal_ec[i]);
cand = rb_entry(rb_first(&super->s_free_list.rb_tree),
struct gc_candidate, rb_node);
max_reserve_ec = cand->erase_count;
for (i = 0; i < 2; i++) {
struct logfs_segment_entry se;
u32 segno = seg_no(sb, super->s_sb_ofs[i]);
u32 ec;
logfs_get_segment_entry(sb, segno, &se);
ec = be32_to_cpu(se.ec_level) >> 4;
max_reserve_ec = max(max_reserve_ec, ec);
}
if (min_journal_ec > max_reserve_ec + 2 * WL_DELTA) {
do_logfs_journal_wl_pass(sb);
}
}
void logfs_gc_pass(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
//BUG_ON(mutex_trylock(&logfs_super(sb)->s_w_mutex));
/* Write journal before free space is getting saturated with dirty
* objects.
*/
if (super->s_dirty_used_bytes + super->s_dirty_free_bytes
+ LOGFS_MAX_OBJECTSIZE >= super->s_free_bytes)
logfs_write_anchor(sb);
__logfs_gc_pass(sb, super->s_total_levels);
logfs_wl_pass(sb);
logfs_journal_wl_pass(sb);
}
static int check_area(struct super_block *sb, int i)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area = super->s_area[i];
struct logfs_object_header oh;
u32 segno = area->a_segno;
u32 ofs = area->a_used_bytes;
__be32 crc;
int err;
if (!area->a_is_open)
return 0;
for (ofs = area->a_used_bytes;
ofs <= super->s_segsize - sizeof(oh);
ofs += (u32)be16_to_cpu(oh.len) + sizeof(oh)) {
err = wbuf_read(sb, dev_ofs(sb, segno, ofs), sizeof(oh), &oh);
if (err)
return err;
if (!memchr_inv(&oh, 0xff, sizeof(oh)))
break;
crc = logfs_crc32(&oh, sizeof(oh) - 4, 4);
if (crc != oh.crc) {
printk(KERN_INFO "interrupted header at %llx\n",
dev_ofs(sb, segno, ofs));
return 0;
}
}
if (ofs != area->a_used_bytes) {
printk(KERN_INFO "%x bytes unaccounted data found at %llx\n",
ofs - area->a_used_bytes,
dev_ofs(sb, segno, area->a_used_bytes));
area->a_used_bytes = ofs;
}
return 0;
}
int logfs_check_areas(struct super_block *sb)
{
int i, err;
for_each_area(i) {
err = check_area(sb, i);
if (err)
return err;
}
return 0;
}
static void logfs_init_candlist(struct candidate_list *list, int maxcount,
int sort_by_ec)
{
list->count = 0;
list->maxcount = maxcount;
list->sort_by_ec = sort_by_ec;
list->rb_tree = RB_ROOT;
}
int logfs_init_gc(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int i;
btree_init_mempool32(&super->s_cand_tree, super->s_btree_pool);
logfs_init_candlist(&super->s_free_list, LIST_SIZE + SCAN_RATIO, 1);
logfs_init_candlist(&super->s_reserve_list,
super->s_bad_seg_reserve, 1);
for_each_area(i)
logfs_init_candlist(&super->s_low_list[i], LIST_SIZE, 0);
logfs_init_candlist(&super->s_ec_list, LIST_SIZE, 1);
return 0;
}
static void logfs_cleanup_list(struct super_block *sb,
struct candidate_list *list)
{
struct gc_candidate *cand;
while (list->count) {
cand = rb_entry(list->rb_tree.rb_node, struct gc_candidate,
rb_node);
remove_from_list(cand);
free_candidate(sb, cand);
}
BUG_ON(list->rb_tree.rb_node);
}
void logfs_cleanup_gc(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int i;
if (!super->s_free_list.count)
return;
/*
* FIXME: The btree may still contain a single empty node. So we
* call the grim visitor to clean up that mess. Btree code should
* do it for us, really.
*/
btree_grim_visitor32(&super->s_cand_tree, 0, NULL);
logfs_cleanup_list(sb, &super->s_free_list);
logfs_cleanup_list(sb, &super->s_reserve_list);
for_each_area(i)
logfs_cleanup_list(sb, &super->s_low_list[i]);
logfs_cleanup_list(sb, &super->s_ec_list);
}

417
fs/logfs/inode.c Normal file
View File

@ -0,0 +1,417 @@
/*
* fs/logfs/inode.c - inode handling code
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
#include <linux/writeback.h>
#include <linux/backing-dev.h>
/*
* How soon to reuse old inode numbers? LogFS doesn't store deleted inodes
* on the medium. It therefore also lacks a method to store the previous
* generation number for deleted inodes. Instead a single generation number
* is stored which will be used for new inodes. Being just a 32bit counter,
* this can obvious wrap relatively quickly. So we only reuse inodes if we
* know that a fair number of inodes can be created before we have to increment
* the generation again - effectively adding some bits to the counter.
* But being too aggressive here means we keep a very large and very sparse
* inode file, wasting space on indirect blocks.
* So what is a good value? Beats me. 64k seems moderately bad on both
* fronts, so let's use that for now...
*
* NFS sucks, as everyone already knows.
*/
#define INOS_PER_WRAP (0x10000)
/*
* Logfs' requirement to read inodes for garbage collection makes life a bit
* harder. GC may have to read inodes that are in I_FREEING state, when they
* are being written out - and waiting for GC to make progress, naturally.
*
* So we cannot just call iget() or some variant of it, but first have to check
* wether the inode in question might be in I_FREEING state. Therefore we
* maintain our own per-sb list of "almost deleted" inodes and check against
* that list first. Normally this should be at most 1-2 entries long.
*
* Also, inodes have logfs-specific reference counting on top of what the vfs
* does. When .destroy_inode is called, normally the reference count will drop
* to zero and the inode gets deleted. But if GC accessed the inode, its
* refcount will remain nonzero and final deletion will have to wait.
*
* As a result we have two sets of functions to get/put inodes:
* logfs_safe_iget/logfs_safe_iput - safe to call from GC context
* logfs_iget/iput - normal version
*/
static struct kmem_cache *logfs_inode_cache;
static DEFINE_SPINLOCK(logfs_inode_lock);
static void logfs_inode_setops(struct inode *inode)
{
switch (inode->i_mode & S_IFMT) {
case S_IFDIR:
inode->i_op = &logfs_dir_iops;
inode->i_fop = &logfs_dir_fops;
inode->i_mapping->a_ops = &logfs_reg_aops;
break;
case S_IFREG:
inode->i_op = &logfs_reg_iops;
inode->i_fop = &logfs_reg_fops;
inode->i_mapping->a_ops = &logfs_reg_aops;
break;
case S_IFLNK:
inode->i_op = &logfs_symlink_iops;
inode->i_mapping->a_ops = &logfs_reg_aops;
break;
case S_IFSOCK: /* fall through */
case S_IFBLK: /* fall through */
case S_IFCHR: /* fall through */
case S_IFIFO:
init_special_inode(inode, inode->i_mode, inode->i_rdev);
break;
default:
BUG();
}
}
static struct inode *__logfs_iget(struct super_block *sb, ino_t ino)
{
struct inode *inode = iget_locked(sb, ino);
int err;
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
return inode;
err = logfs_read_inode(inode);
if (err || inode->i_nlink == 0) {
/* inode->i_nlink == 0 can be true when called from
* block validator */
/* set i_nlink to 0 to prevent caching */
inode->i_nlink = 0;
logfs_inode(inode)->li_flags |= LOGFS_IF_ZOMBIE;
iget_failed(inode);
if (!err)
err = -ENOENT;
return ERR_PTR(err);
}
logfs_inode_setops(inode);
unlock_new_inode(inode);
return inode;
}
struct inode *logfs_iget(struct super_block *sb, ino_t ino)
{
BUG_ON(ino == LOGFS_INO_MASTER);
BUG_ON(ino == LOGFS_INO_SEGFILE);
return __logfs_iget(sb, ino);
}
/*
* is_cached is set to 1 if we hand out a cached inode, 0 otherwise.
* this allows logfs_iput to do the right thing later
*/
struct inode *logfs_safe_iget(struct super_block *sb, ino_t ino, int *is_cached)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_inode *li;
if (ino == LOGFS_INO_MASTER)
return super->s_master_inode;
if (ino == LOGFS_INO_SEGFILE)
return super->s_segfile_inode;
spin_lock(&logfs_inode_lock);
list_for_each_entry(li, &super->s_freeing_list, li_freeing_list)
if (li->vfs_inode.i_ino == ino) {
li->li_refcount++;
spin_unlock(&logfs_inode_lock);
*is_cached = 1;
return &li->vfs_inode;
}
spin_unlock(&logfs_inode_lock);
*is_cached = 0;
return __logfs_iget(sb, ino);
}
static void __logfs_destroy_inode(struct inode *inode)
{
struct logfs_inode *li = logfs_inode(inode);
BUG_ON(li->li_block);
list_del(&li->li_freeing_list);
kmem_cache_free(logfs_inode_cache, li);
}
static void logfs_destroy_inode(struct inode *inode)
{
struct logfs_inode *li = logfs_inode(inode);
BUG_ON(list_empty(&li->li_freeing_list));
spin_lock(&logfs_inode_lock);
li->li_refcount--;
if (li->li_refcount == 0)
__logfs_destroy_inode(inode);
spin_unlock(&logfs_inode_lock);
}
void logfs_safe_iput(struct inode *inode, int is_cached)
{
if (inode->i_ino == LOGFS_INO_MASTER)
return;
if (inode->i_ino == LOGFS_INO_SEGFILE)
return;
if (is_cached) {
logfs_destroy_inode(inode);
return;
}
iput(inode);
}
static void logfs_init_inode(struct super_block *sb, struct inode *inode)
{
struct logfs_inode *li = logfs_inode(inode);
int i;
li->li_flags = 0;
li->li_height = 0;
li->li_used_bytes = 0;
li->li_block = NULL;
inode->i_uid = 0;
inode->i_gid = 0;
inode->i_size = 0;
inode->i_blocks = 0;
inode->i_ctime = CURRENT_TIME;
inode->i_mtime = CURRENT_TIME;
inode->i_nlink = 1;
INIT_LIST_HEAD(&li->li_freeing_list);
for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)
li->li_data[i] = 0;
return;
}
static struct inode *logfs_alloc_inode(struct super_block *sb)
{
struct logfs_inode *li;
li = kmem_cache_alloc(logfs_inode_cache, GFP_NOFS);
if (!li)
return NULL;
logfs_init_inode(sb, &li->vfs_inode);
return &li->vfs_inode;
}
/*
* In logfs inodes are written to an inode file. The inode file, like any
* other file, is managed with a inode. The inode file's inode, aka master
* inode, requires special handling in several respects. First, it cannot be
* written to the inode file, so it is stored in the journal instead.
*
* Secondly, this inode cannot be written back and destroyed before all other
* inodes have been written. The ordering is important. Linux' VFS is happily
* unaware of the ordering constraint and would ordinarily destroy the master
* inode at umount time while other inodes are still in use and dirty. Not
* good.
*
* So logfs makes sure the master inode is not written until all other inodes
* have been destroyed. Sadly, this method has another side-effect. The VFS
* will notice one remaining inode and print a frightening warning message.
* Worse, it is impossible to judge whether such a warning was caused by the
* master inode or any other inodes have leaked as well.
*
* Our attempt of solving this is with logfs_new_meta_inode() below. Its
* purpose is to create a new inode that will not trigger the warning if such
* an inode is still in use. An ugly hack, no doubt. Suggections for
* improvement are welcome.
*/
struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino)
{
struct inode *inode;
inode = logfs_alloc_inode(sb);
if (!inode)
return ERR_PTR(-ENOMEM);
inode->i_mode = S_IFREG;
inode->i_ino = ino;
inode->i_sb = sb;
/* This is a blatant copy of alloc_inode code. We'd need alloc_inode
* to be nonstatic, alas. */
{
struct address_space * const mapping = &inode->i_data;
mapping->a_ops = &logfs_reg_aops;
mapping->host = inode;
mapping->flags = 0;
mapping_set_gfp_mask(mapping, GFP_NOFS);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
inode->i_mapping = mapping;
inode->i_nlink = 1;
}
return inode;
}
struct inode *logfs_read_meta_inode(struct super_block *sb, u64 ino)
{
struct inode *inode;
int err;
inode = logfs_new_meta_inode(sb, ino);
if (IS_ERR(inode))
return inode;
err = logfs_read_inode(inode);
if (err) {
destroy_meta_inode(inode);
return ERR_PTR(err);
}
logfs_inode_setops(inode);
return inode;
}
static int logfs_write_inode(struct inode *inode, struct writeback_control *wbc)
{
int ret;
long flags = WF_LOCK;
/* Can only happen if creat() failed. Safe to skip. */
if (logfs_inode(inode)->li_flags & LOGFS_IF_STILLBORN)
return 0;
ret = __logfs_write_inode(inode, flags);
LOGFS_BUG_ON(ret, inode->i_sb);
return ret;
}
void destroy_meta_inode(struct inode *inode)
{
if (inode) {
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
logfs_clear_inode(inode);
kmem_cache_free(logfs_inode_cache, logfs_inode(inode));
}
}
/* called with inode_lock held */
static void logfs_drop_inode(struct inode *inode)
{
struct logfs_super *super = logfs_super(inode->i_sb);
struct logfs_inode *li = logfs_inode(inode);
spin_lock(&logfs_inode_lock);
list_move(&li->li_freeing_list, &super->s_freeing_list);
spin_unlock(&logfs_inode_lock);
generic_drop_inode(inode);
}
static void logfs_set_ino_generation(struct super_block *sb,
struct inode *inode)
{
struct logfs_super *super = logfs_super(sb);
u64 ino;
mutex_lock(&super->s_journal_mutex);
ino = logfs_seek_hole(super->s_master_inode, super->s_last_ino);
super->s_last_ino = ino;
super->s_inos_till_wrap--;
if (super->s_inos_till_wrap < 0) {
super->s_last_ino = LOGFS_RESERVED_INOS;
super->s_generation++;
super->s_inos_till_wrap = INOS_PER_WRAP;
}
inode->i_ino = ino;
inode->i_generation = super->s_generation;
mutex_unlock(&super->s_journal_mutex);
}
struct inode *logfs_new_inode(struct inode *dir, int mode)
{
struct super_block *sb = dir->i_sb;
struct inode *inode;
inode = new_inode(sb);
if (!inode)
return ERR_PTR(-ENOMEM);
logfs_init_inode(sb, inode);
/* inherit parent flags */
logfs_inode(inode)->li_flags |=
logfs_inode(dir)->li_flags & LOGFS_FL_INHERITED;
inode->i_mode = mode;
logfs_set_ino_generation(sb, inode);
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
if (dir->i_mode & S_ISGID) {
inode->i_gid = dir->i_gid;
if (S_ISDIR(mode))
inode->i_mode |= S_ISGID;
}
logfs_inode_setops(inode);
insert_inode_hash(inode);
return inode;
}
static void logfs_init_once(void *_li)
{
struct logfs_inode *li = _li;
int i;
li->li_flags = 0;
li->li_used_bytes = 0;
li->li_refcount = 1;
for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)
li->li_data[i] = 0;
inode_init_once(&li->vfs_inode);
}
static int logfs_sync_fs(struct super_block *sb, int wait)
{
/* FIXME: write anchor */
logfs_super(sb)->s_devops->sync(sb);
return 0;
}
const struct super_operations logfs_super_operations = {
.alloc_inode = logfs_alloc_inode,
.clear_inode = logfs_clear_inode,
.delete_inode = logfs_delete_inode,
.destroy_inode = logfs_destroy_inode,
.drop_inode = logfs_drop_inode,
.write_inode = logfs_write_inode,
.statfs = logfs_statfs,
.sync_fs = logfs_sync_fs,
};
int logfs_init_inode_cache(void)
{
logfs_inode_cache = kmem_cache_create("logfs_inode_cache",
sizeof(struct logfs_inode), 0, SLAB_RECLAIM_ACCOUNT,
logfs_init_once);
if (!logfs_inode_cache)
return -ENOMEM;
return 0;
}
void logfs_destroy_inode_cache(void)
{
kmem_cache_destroy(logfs_inode_cache);
}

883
fs/logfs/journal.c Normal file
View File

@ -0,0 +1,883 @@
/*
* fs/logfs/journal.c - journal handling code
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*/
#include "logfs.h"
static void logfs_calc_free(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
u64 reserve, no_segs = super->s_no_segs;
s64 free;
int i;
/* superblock segments */
no_segs -= 2;
super->s_no_journal_segs = 0;
/* journal */
journal_for_each(i)
if (super->s_journal_seg[i]) {
no_segs--;
super->s_no_journal_segs++;
}
/* open segments plus one extra per level for GC */
no_segs -= 2 * super->s_total_levels;
free = no_segs * (super->s_segsize - LOGFS_SEGMENT_RESERVE);
free -= super->s_used_bytes;
/* just a bit extra */
free -= super->s_total_levels * 4096;
/* Bad blocks are 'paid' for with speed reserve - the filesystem
* simply gets slower as bad blocks accumulate. Until the bad blocks
* exceed the speed reserve - then the filesystem gets smaller.
*/
reserve = super->s_bad_segments + super->s_bad_seg_reserve;
reserve *= super->s_segsize - LOGFS_SEGMENT_RESERVE;
reserve = max(reserve, super->s_speed_reserve);
free -= reserve;
if (free < 0)
free = 0;
super->s_free_bytes = free;
}
static void reserve_sb_and_journal(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct btree_head32 *head = &super->s_reserved_segments;
int i, err;
err = btree_insert32(head, seg_no(sb, super->s_sb_ofs[0]), (void *)1,
GFP_KERNEL);
BUG_ON(err);
err = btree_insert32(head, seg_no(sb, super->s_sb_ofs[1]), (void *)1,
GFP_KERNEL);
BUG_ON(err);
journal_for_each(i) {
if (!super->s_journal_seg[i])
continue;
err = btree_insert32(head, super->s_journal_seg[i], (void *)1,
GFP_KERNEL);
BUG_ON(err);
}
}
static void read_dynsb(struct super_block *sb,
struct logfs_je_dynsb *dynsb)
{
struct logfs_super *super = logfs_super(sb);
super->s_gec = be64_to_cpu(dynsb->ds_gec);
super->s_sweeper = be64_to_cpu(dynsb->ds_sweeper);
super->s_victim_ino = be64_to_cpu(dynsb->ds_victim_ino);
super->s_rename_dir = be64_to_cpu(dynsb->ds_rename_dir);
super->s_rename_pos = be64_to_cpu(dynsb->ds_rename_pos);
super->s_used_bytes = be64_to_cpu(dynsb->ds_used_bytes);
super->s_generation = be32_to_cpu(dynsb->ds_generation);
}
static void read_anchor(struct super_block *sb,
struct logfs_je_anchor *da)
{
struct logfs_super *super = logfs_super(sb);
struct inode *inode = super->s_master_inode;
struct logfs_inode *li = logfs_inode(inode);
int i;
super->s_last_ino = be64_to_cpu(da->da_last_ino);
li->li_flags = 0;
li->li_height = da->da_height;
i_size_write(inode, be64_to_cpu(da->da_size));
li->li_used_bytes = be64_to_cpu(da->da_used_bytes);
for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)
li->li_data[i] = be64_to_cpu(da->da_data[i]);
}
static void read_erasecount(struct super_block *sb,
struct logfs_je_journal_ec *ec)
{
struct logfs_super *super = logfs_super(sb);
int i;
journal_for_each(i)
super->s_journal_ec[i] = be32_to_cpu(ec->ec[i]);
}
static int read_area(struct super_block *sb, struct logfs_je_area *a)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area = super->s_area[a->gc_level];
u64 ofs;
u32 writemask = ~(super->s_writesize - 1);
if (a->gc_level >= LOGFS_NO_AREAS)
return -EIO;
if (a->vim != VIM_DEFAULT)
return -EIO; /* TODO: close area and continue */
area->a_used_bytes = be32_to_cpu(a->used_bytes);
area->a_written_bytes = area->a_used_bytes & writemask;
area->a_segno = be32_to_cpu(a->segno);
if (area->a_segno)
area->a_is_open = 1;
ofs = dev_ofs(sb, area->a_segno, area->a_written_bytes);
if (super->s_writesize > 1)
logfs_buf_recover(area, ofs, a + 1, super->s_writesize);
else
logfs_buf_recover(area, ofs, NULL, 0);
return 0;
}
static void *unpack(void *from, void *to)
{
struct logfs_journal_header *jh = from;
void *data = from + sizeof(struct logfs_journal_header);
int err;
size_t inlen, outlen;
inlen = be16_to_cpu(jh->h_len);
outlen = be16_to_cpu(jh->h_datalen);
if (jh->h_compr == COMPR_NONE)
memcpy(to, data, inlen);
else {
err = logfs_uncompress(data, to, inlen, outlen);
BUG_ON(err);
}
return to;
}
static int __read_je_header(struct super_block *sb, u64 ofs,
struct logfs_journal_header *jh)
{
struct logfs_super *super = logfs_super(sb);
size_t bufsize = max_t(size_t, sb->s_blocksize, super->s_writesize)
+ MAX_JOURNAL_HEADER;
u16 type, len, datalen;
int err;
/* read header only */
err = wbuf_read(sb, ofs, sizeof(*jh), jh);
if (err)
return err;
type = be16_to_cpu(jh->h_type);
len = be16_to_cpu(jh->h_len);
datalen = be16_to_cpu(jh->h_datalen);
if (len > sb->s_blocksize)
return -EIO;
if ((type < JE_FIRST) || (type > JE_LAST))
return -EIO;
if (datalen > bufsize)
return -EIO;
return 0;
}
static int __read_je_payload(struct super_block *sb, u64 ofs,
struct logfs_journal_header *jh)
{
u16 len;
int err;
len = be16_to_cpu(jh->h_len);
err = wbuf_read(sb, ofs + sizeof(*jh), len, jh + 1);
if (err)
return err;
if (jh->h_crc != logfs_crc32(jh, len + sizeof(*jh), 4)) {
/* Old code was confused. It forgot about the header length
* and stopped calculating the crc 16 bytes before the end
* of data - ick!
* FIXME: Remove this hack once the old code is fixed.
*/
if (jh->h_crc == logfs_crc32(jh, len, 4))
WARN_ON_ONCE(1);
else
return -EIO;
}
return 0;
}
/*
* jh needs to be large enough to hold the complete entry, not just the header
*/
static int __read_je(struct super_block *sb, u64 ofs,
struct logfs_journal_header *jh)
{
int err;
err = __read_je_header(sb, ofs, jh);
if (err)
return err;
return __read_je_payload(sb, ofs, jh);
}
static int read_je(struct super_block *sb, u64 ofs)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_journal_header *jh = super->s_compressed_je;
void *scratch = super->s_je;
u16 type, datalen;
int err;
err = __read_je(sb, ofs, jh);
if (err)
return err;
type = be16_to_cpu(jh->h_type);
datalen = be16_to_cpu(jh->h_datalen);
switch (type) {
case JE_DYNSB:
read_dynsb(sb, unpack(jh, scratch));
break;
case JE_ANCHOR:
read_anchor(sb, unpack(jh, scratch));
break;
case JE_ERASECOUNT:
read_erasecount(sb, unpack(jh, scratch));
break;
case JE_AREA:
read_area(sb, unpack(jh, scratch));
break;
case JE_OBJ_ALIAS:
err = logfs_load_object_aliases(sb, unpack(jh, scratch),
datalen);
break;
default:
WARN_ON_ONCE(1);
return -EIO;
}
return err;
}
static int logfs_read_segment(struct super_block *sb, u32 segno)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_journal_header *jh = super->s_compressed_je;
u64 ofs, seg_ofs = dev_ofs(sb, segno, 0);
u32 h_ofs, last_ofs = 0;
u16 len, datalen, last_len = 0;
int i, err;
/* search for most recent commit */
for (h_ofs = 0; h_ofs < super->s_segsize; h_ofs += sizeof(*jh)) {
ofs = seg_ofs + h_ofs;
err = __read_je_header(sb, ofs, jh);
if (err)
continue;
if (jh->h_type != cpu_to_be16(JE_COMMIT))
continue;
err = __read_je_payload(sb, ofs, jh);
if (err)
continue;
len = be16_to_cpu(jh->h_len);
datalen = be16_to_cpu(jh->h_datalen);
if ((datalen > sizeof(super->s_je_array)) ||
(datalen % sizeof(__be64)))
continue;
last_ofs = h_ofs;
last_len = datalen;
h_ofs += ALIGN(len, sizeof(*jh)) - sizeof(*jh);
}
/* read commit */
if (last_ofs == 0)
return -ENOENT;
ofs = seg_ofs + last_ofs;
log_journal("Read commit from %llx\n", ofs);
err = __read_je(sb, ofs, jh);
BUG_ON(err); /* We should have caught it in the scan loop already */
if (err)
return err;
/* uncompress */
unpack(jh, super->s_je_array);
super->s_no_je = last_len / sizeof(__be64);
/* iterate over array */
for (i = 0; i < super->s_no_je; i++) {
err = read_je(sb, be64_to_cpu(super->s_je_array[i]));
if (err)
return err;
}
super->s_journal_area->a_segno = segno;
return 0;
}
static u64 read_gec(struct super_block *sb, u32 segno)
{
struct logfs_segment_header sh;
__be32 crc;
int err;
if (!segno)
return 0;
err = wbuf_read(sb, dev_ofs(sb, segno, 0), sizeof(sh), &sh);
if (err)
return 0;
crc = logfs_crc32(&sh, sizeof(sh), 4);
if (crc != sh.crc) {
WARN_ON(sh.gec != cpu_to_be64(0xffffffffffffffffull));
/* Most likely it was just erased */
return 0;
}
return be64_to_cpu(sh.gec);
}
static int logfs_read_journal(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
u64 gec[LOGFS_JOURNAL_SEGS], max;
u32 segno;
int i, max_i;
max = 0;
max_i = -1;
journal_for_each(i) {
segno = super->s_journal_seg[i];
gec[i] = read_gec(sb, super->s_journal_seg[i]);
if (gec[i] > max) {
max = gec[i];
max_i = i;
}
}
if (max_i == -1)
return -EIO;
/* FIXME: Try older segments in case of error */
return logfs_read_segment(sb, super->s_journal_seg[max_i]);
}
/*
* First search the current segment (outer loop), then pick the next segment
* in the array, skipping any zero entries (inner loop).
*/
static void journal_get_free_segment(struct logfs_area *area)
{
struct logfs_super *super = logfs_super(area->a_sb);
int i;
journal_for_each(i) {
if (area->a_segno != super->s_journal_seg[i])
continue;
do {
i++;
if (i == LOGFS_JOURNAL_SEGS)
i = 0;
} while (!super->s_journal_seg[i]);
area->a_segno = super->s_journal_seg[i];
area->a_erase_count = ++(super->s_journal_ec[i]);
log_journal("Journal now at %x (ec %x)\n", area->a_segno,
area->a_erase_count);
return;
}
BUG();
}
static void journal_get_erase_count(struct logfs_area *area)
{
/* erase count is stored globally and incremented in
* journal_get_free_segment() - nothing to do here */
}
static int journal_erase_segment(struct logfs_area *area)
{
struct super_block *sb = area->a_sb;
struct logfs_segment_header sh;
u64 ofs;
int err;
err = logfs_erase_segment(sb, area->a_segno, 1);
if (err)
return err;
sh.pad = 0;
sh.type = SEG_JOURNAL;
sh.level = 0;
sh.segno = cpu_to_be32(area->a_segno);
sh.ec = cpu_to_be32(area->a_erase_count);
sh.gec = cpu_to_be64(logfs_super(sb)->s_gec);
sh.crc = logfs_crc32(&sh, sizeof(sh), 4);
/* This causes a bug in segment.c. Not yet. */
//logfs_set_segment_erased(sb, area->a_segno, area->a_erase_count, 0);
ofs = dev_ofs(sb, area->a_segno, 0);
area->a_used_bytes = ALIGN(sizeof(sh), 16);
logfs_buf_write(area, ofs, &sh, sizeof(sh));
return 0;
}
static size_t __logfs_write_header(struct logfs_super *super,
struct logfs_journal_header *jh, size_t len, size_t datalen,
u16 type, u8 compr)
{
jh->h_len = cpu_to_be16(len);
jh->h_type = cpu_to_be16(type);
jh->h_datalen = cpu_to_be16(datalen);
jh->h_compr = compr;
jh->h_pad[0] = 'H';
jh->h_pad[1] = 'E';
jh->h_pad[2] = 'A';
jh->h_pad[3] = 'D';
jh->h_pad[4] = 'R';
jh->h_crc = logfs_crc32(jh, len + sizeof(*jh), 4);
return ALIGN(len, 16) + sizeof(*jh);
}
static size_t logfs_write_header(struct logfs_super *super,
struct logfs_journal_header *jh, size_t datalen, u16 type)
{
size_t len = datalen;
return __logfs_write_header(super, jh, len, datalen, type, COMPR_NONE);
}
static inline size_t logfs_journal_erasecount_size(struct logfs_super *super)
{
return LOGFS_JOURNAL_SEGS * sizeof(__be32);
}
static void *logfs_write_erasecount(struct super_block *sb, void *_ec,
u16 *type, size_t *len)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_je_journal_ec *ec = _ec;
int i;
journal_for_each(i)
ec->ec[i] = cpu_to_be32(super->s_journal_ec[i]);
*type = JE_ERASECOUNT;
*len = logfs_journal_erasecount_size(super);
return ec;
}
static void account_shadow(void *_shadow, unsigned long _sb, u64 ignore,
size_t ignore2)
{
struct logfs_shadow *shadow = _shadow;
struct super_block *sb = (void *)_sb;
struct logfs_super *super = logfs_super(sb);
/* consume new space */
super->s_free_bytes -= shadow->new_len;
super->s_used_bytes += shadow->new_len;
super->s_dirty_used_bytes -= shadow->new_len;
/* free up old space */
super->s_free_bytes += shadow->old_len;
super->s_used_bytes -= shadow->old_len;
super->s_dirty_free_bytes -= shadow->old_len;
logfs_set_segment_used(sb, shadow->old_ofs, -shadow->old_len);
logfs_set_segment_used(sb, shadow->new_ofs, shadow->new_len);
log_journal("account_shadow(%llx, %llx, %x) %llx->%llx %x->%x\n",
shadow->ino, shadow->bix, shadow->gc_level,
shadow->old_ofs, shadow->new_ofs,
shadow->old_len, shadow->new_len);
mempool_free(shadow, super->s_shadow_pool);
}
static void account_shadows(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct inode *inode = super->s_master_inode;
struct logfs_inode *li = logfs_inode(inode);
struct shadow_tree *tree = &super->s_shadow_tree;
btree_grim_visitor64(&tree->new, (unsigned long)sb, account_shadow);
btree_grim_visitor64(&tree->old, (unsigned long)sb, account_shadow);
if (li->li_block) {
/*
* We never actually use the structure, when attached to the
* master inode. But it is easier to always free it here than
* to have checks in several places elsewhere when allocating
* it.
*/
li->li_block->ops->free_block(sb, li->li_block);
}
BUG_ON((s64)li->li_used_bytes < 0);
}
static void *__logfs_write_anchor(struct super_block *sb, void *_da,
u16 *type, size_t *len)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_je_anchor *da = _da;
struct inode *inode = super->s_master_inode;
struct logfs_inode *li = logfs_inode(inode);
int i;
da->da_height = li->li_height;
da->da_last_ino = cpu_to_be64(super->s_last_ino);
da->da_size = cpu_to_be64(i_size_read(inode));
da->da_used_bytes = cpu_to_be64(li->li_used_bytes);
for (i = 0; i < LOGFS_EMBEDDED_FIELDS; i++)
da->da_data[i] = cpu_to_be64(li->li_data[i]);
*type = JE_ANCHOR;
*len = sizeof(*da);
return da;
}
static void *logfs_write_dynsb(struct super_block *sb, void *_dynsb,
u16 *type, size_t *len)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_je_dynsb *dynsb = _dynsb;
dynsb->ds_gec = cpu_to_be64(super->s_gec);
dynsb->ds_sweeper = cpu_to_be64(super->s_sweeper);
dynsb->ds_victim_ino = cpu_to_be64(super->s_victim_ino);
dynsb->ds_rename_dir = cpu_to_be64(super->s_rename_dir);
dynsb->ds_rename_pos = cpu_to_be64(super->s_rename_pos);
dynsb->ds_used_bytes = cpu_to_be64(super->s_used_bytes);
dynsb->ds_generation = cpu_to_be32(super->s_generation);
*type = JE_DYNSB;
*len = sizeof(*dynsb);
return dynsb;
}
static void write_wbuf(struct super_block *sb, struct logfs_area *area,
void *wbuf)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
u64 ofs;
pgoff_t index;
int page_ofs;
struct page *page;
ofs = dev_ofs(sb, area->a_segno,
area->a_used_bytes & ~(super->s_writesize - 1));
index = ofs >> PAGE_SHIFT;
page_ofs = ofs & (PAGE_SIZE - 1);
page = find_lock_page(mapping, index);
BUG_ON(!page);
memcpy(wbuf, page_address(page) + page_ofs, super->s_writesize);
unlock_page(page);
}
static void *logfs_write_area(struct super_block *sb, void *_a,
u16 *type, size_t *len)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area = super->s_area[super->s_sum_index];
struct logfs_je_area *a = _a;
a->vim = VIM_DEFAULT;
a->gc_level = super->s_sum_index;
a->used_bytes = cpu_to_be32(area->a_used_bytes);
a->segno = cpu_to_be32(area->a_segno);
if (super->s_writesize > 1)
write_wbuf(sb, area, a + 1);
*type = JE_AREA;
*len = sizeof(*a) + super->s_writesize;
return a;
}
static void *logfs_write_commit(struct super_block *sb, void *h,
u16 *type, size_t *len)
{
struct logfs_super *super = logfs_super(sb);
*type = JE_COMMIT;
*len = super->s_no_je * sizeof(__be64);
return super->s_je_array;
}
static size_t __logfs_write_je(struct super_block *sb, void *buf, u16 type,
size_t len)
{
struct logfs_super *super = logfs_super(sb);
void *header = super->s_compressed_je;
void *data = header + sizeof(struct logfs_journal_header);
ssize_t compr_len, pad_len;
u8 compr = COMPR_ZLIB;
if (len == 0)
return logfs_write_header(super, header, 0, type);
compr_len = logfs_compress(buf, data, len, sb->s_blocksize);
if (compr_len < 0 || type == JE_ANCHOR) {
BUG_ON(len > sb->s_blocksize);
memcpy(data, buf, len);
compr_len = len;
compr = COMPR_NONE;
}
pad_len = ALIGN(compr_len, 16);
memset(data + compr_len, 0, pad_len - compr_len);
return __logfs_write_header(super, header, compr_len, len, type, compr);
}
static s64 logfs_get_free_bytes(struct logfs_area *area, size_t *bytes,
int must_pad)
{
u32 writesize = logfs_super(area->a_sb)->s_writesize;
s32 ofs;
int ret;
ret = logfs_open_area(area, *bytes);
if (ret)
return -EAGAIN;
ofs = area->a_used_bytes;
area->a_used_bytes += *bytes;
if (must_pad) {
area->a_used_bytes = ALIGN(area->a_used_bytes, writesize);
*bytes = area->a_used_bytes - ofs;
}
return dev_ofs(area->a_sb, area->a_segno, ofs);
}
static int logfs_write_je_buf(struct super_block *sb, void *buf, u16 type,
size_t buf_len)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area = super->s_journal_area;
struct logfs_journal_header *jh = super->s_compressed_je;
size_t len;
int must_pad = 0;
s64 ofs;
len = __logfs_write_je(sb, buf, type, buf_len);
if (jh->h_type == cpu_to_be16(JE_COMMIT))
must_pad = 1;
ofs = logfs_get_free_bytes(area, &len, must_pad);
if (ofs < 0)
return ofs;
logfs_buf_write(area, ofs, super->s_compressed_je, len);
super->s_je_array[super->s_no_je++] = cpu_to_be64(ofs);
return 0;
}
static int logfs_write_je(struct super_block *sb,
void* (*write)(struct super_block *sb, void *scratch,
u16 *type, size_t *len))
{
void *buf;
size_t len;
u16 type;
buf = write(sb, logfs_super(sb)->s_je, &type, &len);
return logfs_write_je_buf(sb, buf, type, len);
}
int write_alias_journal(struct super_block *sb, u64 ino, u64 bix,
level_t level, int child_no, __be64 val)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_obj_alias *oa = super->s_je;
int err = 0, fill = super->s_je_fill;
log_aliases("logfs_write_obj_aliases #%x(%llx, %llx, %x, %x) %llx\n",
fill, ino, bix, level, child_no, be64_to_cpu(val));
oa[fill].ino = cpu_to_be64(ino);
oa[fill].bix = cpu_to_be64(bix);
oa[fill].val = val;
oa[fill].level = (__force u8)level;
oa[fill].child_no = cpu_to_be16(child_no);
fill++;
if (fill >= sb->s_blocksize / sizeof(*oa)) {
err = logfs_write_je_buf(sb, oa, JE_OBJ_ALIAS, sb->s_blocksize);
fill = 0;
}
super->s_je_fill = fill;
return err;
}
static int logfs_write_obj_aliases(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int err;
log_journal("logfs_write_obj_aliases: %d aliases to write\n",
super->s_no_object_aliases);
super->s_je_fill = 0;
err = logfs_write_obj_aliases_pagecache(sb);
if (err)
return err;
if (super->s_je_fill)
err = logfs_write_je_buf(sb, super->s_je, JE_OBJ_ALIAS,
super->s_je_fill
* sizeof(struct logfs_obj_alias));
return err;
}
/*
* Write all journal entries. The goto logic ensures that all journal entries
* are written whenever a new segment is used. It is ugly and potentially a
* bit wasteful, but robustness is more important. With this we can *always*
* erase all journal segments except the one containing the most recent commit.
*/
void logfs_write_anchor(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area = super->s_journal_area;
int i, err;
if (!(super->s_flags & LOGFS_SB_FLAG_DIRTY))
return;
super->s_flags &= ~LOGFS_SB_FLAG_DIRTY;
BUG_ON(super->s_flags & LOGFS_SB_FLAG_SHUTDOWN);
mutex_lock(&super->s_journal_mutex);
/* Do this first or suffer corruption */
logfs_sync_segments(sb);
account_shadows(sb);
again:
super->s_no_je = 0;
for_each_area(i) {
if (!super->s_area[i]->a_is_open)
continue;
super->s_sum_index = i;
err = logfs_write_je(sb, logfs_write_area);
if (err)
goto again;
}
err = logfs_write_obj_aliases(sb);
if (err)
goto again;
err = logfs_write_je(sb, logfs_write_erasecount);
if (err)
goto again;
err = logfs_write_je(sb, __logfs_write_anchor);
if (err)
goto again;
err = logfs_write_je(sb, logfs_write_dynsb);
if (err)
goto again;
/*
* Order is imperative. First we sync all writes, including the
* non-committed journal writes. Then we write the final commit and
* sync the current journal segment.
* There is a theoretical bug here. Syncing the journal segment will
* write a number of journal entries and the final commit. All these
* are written in a single operation. If the device layer writes the
* data back-to-front, the commit will precede the other journal
* entries, leaving a race window.
* Two fixes are possible. Preferred is to fix the device layer to
* ensure writes happen front-to-back. Alternatively we can insert
* another logfs_sync_area() super->s_devops->sync() combo before
* writing the commit.
*/
/*
* On another subject, super->s_devops->sync is usually not necessary.
* Unless called from sys_sync or friends, a barrier would suffice.
*/
super->s_devops->sync(sb);
err = logfs_write_je(sb, logfs_write_commit);
if (err)
goto again;
log_journal("Write commit to %llx\n",
be64_to_cpu(super->s_je_array[super->s_no_je - 1]));
logfs_sync_area(area);
BUG_ON(area->a_used_bytes != area->a_written_bytes);
super->s_devops->sync(sb);
mutex_unlock(&super->s_journal_mutex);
return;
}
void do_logfs_journal_wl_pass(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_area *area = super->s_journal_area;
u32 segno, ec;
int i, err;
log_journal("Journal requires wear-leveling.\n");
/* Drop old segments */
journal_for_each(i)
if (super->s_journal_seg[i]) {
logfs_set_segment_unreserved(sb,
super->s_journal_seg[i],
super->s_journal_ec[i]);
super->s_journal_seg[i] = 0;
super->s_journal_ec[i] = 0;
}
/* Get new segments */
for (i = 0; i < super->s_no_journal_segs; i++) {
segno = get_best_cand(sb, &super->s_reserve_list, &ec);
super->s_journal_seg[i] = segno;
super->s_journal_ec[i] = ec;
logfs_set_segment_reserved(sb, segno);
}
/* Manually move journal_area */
area->a_segno = super->s_journal_seg[0];
area->a_is_open = 0;
area->a_used_bytes = 0;
/* Write journal */
logfs_write_anchor(sb);
/* Write superblocks */
err = logfs_write_sb(sb);
BUG_ON(err);
}
static const struct logfs_area_ops journal_area_ops = {
.get_free_segment = journal_get_free_segment,
.get_erase_count = journal_get_erase_count,
.erase_segment = journal_erase_segment,
};
int logfs_init_journal(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
size_t bufsize = max_t(size_t, sb->s_blocksize, super->s_writesize)
+ MAX_JOURNAL_HEADER;
int ret = -ENOMEM;
mutex_init(&super->s_journal_mutex);
btree_init_mempool32(&super->s_reserved_segments, super->s_btree_pool);
super->s_je = kzalloc(bufsize, GFP_KERNEL);
if (!super->s_je)
return ret;
super->s_compressed_je = kzalloc(bufsize, GFP_KERNEL);
if (!super->s_compressed_je)
return ret;
super->s_master_inode = logfs_new_meta_inode(sb, LOGFS_INO_MASTER);
if (IS_ERR(super->s_master_inode))
return PTR_ERR(super->s_master_inode);
ret = logfs_read_journal(sb);
if (ret)
return -EIO;
reserve_sb_and_journal(sb);
logfs_calc_free(sb);
super->s_journal_area->a_ops = &journal_area_ops;
return 0;
}
void logfs_cleanup_journal(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
btree_grim_visitor32(&super->s_reserved_segments, 0, NULL);
destroy_meta_inode(super->s_master_inode);
super->s_master_inode = NULL;
kfree(super->s_compressed_je);
kfree(super->s_je);
}

724
fs/logfs/logfs.h Normal file
View File

@ -0,0 +1,724 @@
/*
* fs/logfs/logfs.h
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*
* Private header for logfs.
*/
#ifndef FS_LOGFS_LOGFS_H
#define FS_LOGFS_LOGFS_H
#undef __CHECK_ENDIAN__
#define __CHECK_ENDIAN__
#include <linux/btree.h>
#include <linux/crc32.h>
#include <linux/fs.h>
#include <linux/kernel.h>
#include <linux/mempool.h>
#include <linux/pagemap.h>
#include <linux/mtd/mtd.h>
#include "logfs_abi.h"
#define LOGFS_DEBUG_SUPER (0x0001)
#define LOGFS_DEBUG_SEGMENT (0x0002)
#define LOGFS_DEBUG_JOURNAL (0x0004)
#define LOGFS_DEBUG_DIR (0x0008)
#define LOGFS_DEBUG_FILE (0x0010)
#define LOGFS_DEBUG_INODE (0x0020)
#define LOGFS_DEBUG_READWRITE (0x0040)
#define LOGFS_DEBUG_GC (0x0080)
#define LOGFS_DEBUG_GC_NOISY (0x0100)
#define LOGFS_DEBUG_ALIASES (0x0200)
#define LOGFS_DEBUG_BLOCKMOVE (0x0400)
#define LOGFS_DEBUG_ALL (0xffffffff)
#define LOGFS_DEBUG (0x01)
/*
* To enable specific log messages, simply define LOGFS_DEBUG to match any
* or all of the above.
*/
#ifndef LOGFS_DEBUG
#define LOGFS_DEBUG (0)
#endif
#define log_cond(cond, fmt, arg...) do { \
if (cond) \
printk(KERN_DEBUG fmt, ##arg); \
} while (0)
#define log_super(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_SUPER, fmt, ##arg)
#define log_segment(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_SEGMENT, fmt, ##arg)
#define log_journal(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_JOURNAL, fmt, ##arg)
#define log_dir(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_DIR, fmt, ##arg)
#define log_file(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_FILE, fmt, ##arg)
#define log_inode(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_INODE, fmt, ##arg)
#define log_readwrite(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_READWRITE, fmt, ##arg)
#define log_gc(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_GC, fmt, ##arg)
#define log_gc_noisy(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_GC_NOISY, fmt, ##arg)
#define log_aliases(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_ALIASES, fmt, ##arg)
#define log_blockmove(fmt, arg...) \
log_cond(LOGFS_DEBUG & LOGFS_DEBUG_BLOCKMOVE, fmt, ##arg)
#define PG_pre_locked PG_owner_priv_1
#define PagePreLocked(page) test_bit(PG_pre_locked, &(page)->flags)
#define SetPagePreLocked(page) set_bit(PG_pre_locked, &(page)->flags)
#define ClearPagePreLocked(page) clear_bit(PG_pre_locked, &(page)->flags)
/* FIXME: This should really be somewhere in the 64bit area. */
#define LOGFS_LINK_MAX (1<<30)
/* Read-only filesystem */
#define LOGFS_SB_FLAG_RO 0x0001
#define LOGFS_SB_FLAG_DIRTY 0x0002
#define LOGFS_SB_FLAG_OBJ_ALIAS 0x0004
#define LOGFS_SB_FLAG_SHUTDOWN 0x0008
/* Write Control Flags */
#define WF_LOCK 0x01 /* take write lock */
#define WF_WRITE 0x02 /* write block */
#define WF_DELETE 0x04 /* delete old block */
typedef u8 __bitwise level_t;
typedef u8 __bitwise gc_level_t;
#define LEVEL(level) ((__force level_t)(level))
#define GC_LEVEL(gc_level) ((__force gc_level_t)(gc_level))
#define SUBLEVEL(level) ( (void)((level) == LEVEL(1)), \
(__force level_t)((__force u8)(level) - 1) )
/**
* struct logfs_area - area management information
*
* @a_sb: the superblock this area belongs to
* @a_is_open: 1 if the area is currently open, else 0
* @a_segno: segment number of area
* @a_written_bytes: number of bytes already written back
* @a_used_bytes: number of used bytes
* @a_ops: area operations (either journal or ostore)
* @a_erase_count: erase count
* @a_level: GC level
*/
struct logfs_area { /* a segment open for writing */
struct super_block *a_sb;
int a_is_open;
u32 a_segno;
u32 a_written_bytes;
u32 a_used_bytes;
const struct logfs_area_ops *a_ops;
u32 a_erase_count;
gc_level_t a_level;
};
/**
* struct logfs_area_ops - area operations
*
* @get_free_segment: fill area->ofs with the offset of a free segment
* @get_erase_count: fill area->erase_count (needs area->ofs)
* @erase_segment: erase and setup segment
*/
struct logfs_area_ops {
void (*get_free_segment)(struct logfs_area *area);
void (*get_erase_count)(struct logfs_area *area);
int (*erase_segment)(struct logfs_area *area);
};
/**
* struct logfs_device_ops - device access operations
*
* @readpage: read one page (mm page)
* @writeseg: write one segment. may be a partial segment
* @erase: erase one segment
* @read: read from the device
* @erase: erase part of the device
*/
struct logfs_device_ops {
struct page *(*find_first_sb)(struct super_block *sb, u64 *ofs);
struct page *(*find_last_sb)(struct super_block *sb, u64 *ofs);
int (*write_sb)(struct super_block *sb, struct page *page);
int (*readpage)(void *_sb, struct page *page);
void (*writeseg)(struct super_block *sb, u64 ofs, size_t len);
int (*erase)(struct super_block *sb, loff_t ofs, size_t len,
int ensure_write);
void (*sync)(struct super_block *sb);
void (*put_device)(struct super_block *sb);
};
/**
* struct candidate_list - list of similar candidates
*/
struct candidate_list {
struct rb_root rb_tree;
int count;
int maxcount;
int sort_by_ec;
};
/**
* struct gc_candidate - "candidate" segment to be garbage collected next
*
* @list: list (either free of low)
* @segno: segment number
* @valid: number of valid bytes
* @erase_count: erase count of segment
* @dist: distance from tree root
*
* Candidates can be on two lists. The free list contains electees rather
* than candidates - segments that no longer contain any valid data. The
* low list contains candidates to be picked for GC. It should be kept
* short. It is not required to always pick a perfect candidate. In the
* worst case GC will have to move more data than absolutely necessary.
*/
struct gc_candidate {
struct rb_node rb_node;
struct candidate_list *list;
u32 segno;
u32 valid;
u32 erase_count;
u8 dist;
};
/**
* struct logfs_journal_entry - temporary structure used during journal scan
*
* @used:
* @version: normalized version
* @len: length
* @offset: offset
*/
struct logfs_journal_entry {
int used;
s16 version;
u16 len;
u16 datalen;
u64 offset;
};
enum transaction_state {
CREATE_1 = 1,
CREATE_2,
UNLINK_1,
UNLINK_2,
CROSS_RENAME_1,
CROSS_RENAME_2,
TARGET_RENAME_1,
TARGET_RENAME_2,
TARGET_RENAME_3
};
/**
* struct logfs_transaction - essential fields to support atomic dirops
*
* @ino: target inode
* @dir: inode of directory containing dentry
* @pos: pos of dentry in directory
*/
struct logfs_transaction {
enum transaction_state state;
u64 ino;
u64 dir;
u64 pos;
};
/**
* struct logfs_shadow - old block in the shadow of a not-yet-committed new one
* @old_ofs: offset of old block on medium
* @new_ofs: offset of new block on medium
* @ino: inode number
* @bix: block index
* @old_len: size of old block, including header
* @new_len: size of new block, including header
* @level: block level
*/
struct logfs_shadow {
u64 old_ofs;
u64 new_ofs;
u64 ino;
u64 bix;
int old_len;
int new_len;
gc_level_t gc_level;
};
/**
* struct shadow_tree
* @new: shadows where old_ofs==0, indexed by new_ofs
* @old: shadows where old_ofs!=0, indexed by old_ofs
*/
struct shadow_tree {
struct btree_head64 new;
struct btree_head64 old;
};
struct object_alias_item {
struct list_head list;
__be64 val;
int child_no;
};
/**
* struct logfs_block - contains any block state
* @type: indirect block or inode
* @full: number of fully populated children
* @partial: number of partially populated children
*
* Most blocks are directly represented by page cache pages. But when a block
* becomes dirty, is part of a transaction, contains aliases or is otherwise
* special, a struct logfs_block is allocated to track the additional state.
* Inodes are very similar to indirect blocks, so they can also get one of
* these structures added when appropriate.
*/
#define BLOCK_INDIRECT 1 /* Indirect block */
#define BLOCK_INODE 2 /* Inode */
struct logfs_block_ops;
struct logfs_block {
struct list_head alias_list;
struct list_head item_list;
struct super_block *sb;
u64 ino;
u64 bix;
level_t level;
struct page *page;
struct inode *inode;
struct logfs_transaction *ta;
unsigned long alias_map[LOGFS_BLOCK_FACTOR / BITS_PER_LONG];
struct logfs_block_ops *ops;
int full;
int partial;
int reserved_bytes;
};
typedef int write_alias_t(struct super_block *sb, u64 ino, u64 bix,
level_t level, int child_no, __be64 val);
struct logfs_block_ops {
void (*write_block)(struct logfs_block *block);
gc_level_t (*block_level)(struct logfs_block *block);
void (*free_block)(struct super_block *sb, struct logfs_block*block);
int (*write_alias)(struct super_block *sb,
struct logfs_block *block,
write_alias_t *write_one_alias);
};
struct logfs_super {
struct mtd_info *s_mtd; /* underlying device */
struct block_device *s_bdev; /* underlying device */
const struct logfs_device_ops *s_devops;/* device access */
struct inode *s_master_inode; /* inode file */
struct inode *s_segfile_inode; /* segment file */
struct inode *s_mapping_inode; /* device mapping */
atomic_t s_pending_writes; /* outstanting bios */
long s_flags;
mempool_t *s_btree_pool; /* for btree nodes */
mempool_t *s_alias_pool; /* aliases in segment.c */
u64 s_feature_incompat;
u64 s_feature_ro_compat;
u64 s_feature_compat;
u64 s_feature_flags;
u64 s_sb_ofs[2];
struct page *s_erase_page; /* for dev_bdev.c */
/* alias.c fields */
struct btree_head32 s_segment_alias; /* remapped segments */
int s_no_object_aliases;
struct list_head s_object_alias; /* remapped objects */
struct btree_head128 s_object_alias_tree; /* remapped objects */
struct mutex s_object_alias_mutex;
/* dir.c fields */
struct mutex s_dirop_mutex; /* for creat/unlink/rename */
u64 s_victim_ino; /* used for atomic dir-ops */
u64 s_rename_dir; /* source directory ino */
u64 s_rename_pos; /* position of source dd */
/* gc.c fields */
long s_segsize; /* size of a segment */
int s_segshift; /* log2 of segment size */
long s_segmask; /* 1 << s_segshift - 1 */
long s_no_segs; /* segments on device */
long s_no_journal_segs; /* segments used for journal */
long s_no_blocks; /* blocks per segment */
long s_writesize; /* minimum write size */
int s_writeshift; /* log2 of write size */
u64 s_size; /* filesystem size */
struct logfs_area *s_area[LOGFS_NO_AREAS]; /* open segment array */
u64 s_gec; /* global erase count */
u64 s_wl_gec_ostore; /* time of last wl event */
u64 s_wl_gec_journal; /* time of last wl event */
u64 s_sweeper; /* current sweeper pos */
u8 s_ifile_levels; /* max level of ifile */
u8 s_iblock_levels; /* max level of regular files */
u8 s_data_levels; /* # of segments to leaf block*/
u8 s_total_levels; /* sum of above three */
struct btree_head32 s_cand_tree; /* all candidates */
struct candidate_list s_free_list; /* 100% free segments */
struct candidate_list s_reserve_list; /* Bad segment reserve */
struct candidate_list s_low_list[LOGFS_NO_AREAS];/* good candidates */
struct candidate_list s_ec_list; /* wear level candidates */
struct btree_head32 s_reserved_segments;/* sb, journal, bad, etc. */
/* inode.c fields */
u64 s_last_ino; /* highest ino used */
long s_inos_till_wrap;
u32 s_generation; /* i_generation for new files */
struct list_head s_freeing_list; /* inodes being freed */
/* journal.c fields */
struct mutex s_journal_mutex;
void *s_je; /* journal entry to compress */
void *s_compressed_je; /* block to write to journal */
u32 s_journal_seg[LOGFS_JOURNAL_SEGS]; /* journal segments */
u32 s_journal_ec[LOGFS_JOURNAL_SEGS]; /* journal erasecounts */
u64 s_last_version;
struct logfs_area *s_journal_area; /* open journal segment */
__be64 s_je_array[64];
int s_no_je;
int s_sum_index; /* for the 12 summaries */
struct shadow_tree s_shadow_tree;
int s_je_fill; /* index of current je */
/* readwrite.c fields */
struct mutex s_write_mutex;
int s_lock_count;
mempool_t *s_block_pool; /* struct logfs_block pool */
mempool_t *s_shadow_pool; /* struct logfs_shadow pool */
/*
* Space accounting:
* - s_used_bytes specifies space used to store valid data objects.
* - s_dirty_used_bytes is space used to store non-committed data
* objects. Those objects have already been written themselves,
* but they don't become valid until all indirect blocks up to the
* journal have been written as well.
* - s_dirty_free_bytes is space used to store the old copy of a
* replaced object, as long as the replacement is non-committed.
* In other words, it is the amount of space freed when all dirty
* blocks are written back.
* - s_free_bytes is the amount of free space available for any
* purpose.
* - s_root_reserve is the amount of free space available only to
* the root user. Non-privileged users can no longer write once
* this watermark has been reached.
* - s_speed_reserve is space which remains unused to speed up
* garbage collection performance.
* - s_dirty_pages is the space reserved for currently dirty pages.
* It is a pessimistic estimate, so some/most will get freed on
* page writeback.
*
* s_used_bytes + s_free_bytes + s_speed_reserve = total usable size
*/
u64 s_free_bytes;
u64 s_used_bytes;
u64 s_dirty_free_bytes;
u64 s_dirty_used_bytes;
u64 s_root_reserve;
u64 s_speed_reserve;
u64 s_dirty_pages;
/* Bad block handling:
* - s_bad_seg_reserve is a number of segments usually kept
* free. When encountering bad blocks, the affected segment's data
* is _temporarily_ moved to a reserved segment.
* - s_bad_segments is the number of known bad segments.
*/
u32 s_bad_seg_reserve;
u32 s_bad_segments;
};
/**
* struct logfs_inode - in-memory inode
*
* @vfs_inode: struct inode
* @li_data: data pointers
* @li_used_bytes: number of used bytes
* @li_freeing_list: used to track inodes currently being freed
* @li_flags: inode flags
* @li_refcount: number of internal (GC-induced) references
*/
struct logfs_inode {
struct inode vfs_inode;
u64 li_data[LOGFS_EMBEDDED_FIELDS];
u64 li_used_bytes;
struct list_head li_freeing_list;
struct logfs_block *li_block;
u32 li_flags;
u8 li_height;
int li_refcount;
};
#define journal_for_each(__i) for (__i = 0; __i < LOGFS_JOURNAL_SEGS; __i++)
#define for_each_area(__i) for (__i = 0; __i < LOGFS_NO_AREAS; __i++)
#define for_each_area_down(__i) for (__i = LOGFS_NO_AREAS - 1; __i >= 0; __i--)
/* compr.c */
int logfs_compress(void *in, void *out, size_t inlen, size_t outlen);
int logfs_uncompress(void *in, void *out, size_t inlen, size_t outlen);
int __init logfs_compr_init(void);
void logfs_compr_exit(void);
/* dev_bdev.c */
#ifdef CONFIG_BLOCK
int logfs_get_sb_bdev(struct file_system_type *type, int flags,
const char *devname, struct vfsmount *mnt);
#else
static inline int logfs_get_sb_bdev(struct file_system_type *type, int flags,
const char *devname, struct vfsmount *mnt)
{
return -ENODEV;
}
#endif
/* dev_mtd.c */
#ifdef CONFIG_MTD
int logfs_get_sb_mtd(struct file_system_type *type, int flags,
int mtdnr, struct vfsmount *mnt);
#else
static inline int logfs_get_sb_mtd(struct file_system_type *type, int flags,
int mtdnr, struct vfsmount *mnt)
{
return -ENODEV;
}
#endif
/* dir.c */
extern const struct inode_operations logfs_symlink_iops;
extern const struct inode_operations logfs_dir_iops;
extern const struct file_operations logfs_dir_fops;
int logfs_replay_journal(struct super_block *sb);
/* file.c */
extern const struct inode_operations logfs_reg_iops;
extern const struct file_operations logfs_reg_fops;
extern const struct address_space_operations logfs_reg_aops;
int logfs_readpage(struct file *file, struct page *page);
int logfs_ioctl(struct inode *inode, struct file *file, unsigned int cmd,
unsigned long arg);
int logfs_fsync(struct file *file, struct dentry *dentry, int datasync);
/* gc.c */
u32 get_best_cand(struct super_block *sb, struct candidate_list *list, u32 *ec);
void logfs_gc_pass(struct super_block *sb);
int logfs_check_areas(struct super_block *sb);
int logfs_init_gc(struct super_block *sb);
void logfs_cleanup_gc(struct super_block *sb);
/* inode.c */
extern const struct super_operations logfs_super_operations;
struct inode *logfs_iget(struct super_block *sb, ino_t ino);
struct inode *logfs_safe_iget(struct super_block *sb, ino_t ino, int *cookie);
void logfs_safe_iput(struct inode *inode, int cookie);
struct inode *logfs_new_inode(struct inode *dir, int mode);
struct inode *logfs_new_meta_inode(struct super_block *sb, u64 ino);
struct inode *logfs_read_meta_inode(struct super_block *sb, u64 ino);
int logfs_init_inode_cache(void);
void logfs_destroy_inode_cache(void);
void destroy_meta_inode(struct inode *inode);
void logfs_set_blocks(struct inode *inode, u64 no);
/* these logically belong into inode.c but actually reside in readwrite.c */
int logfs_read_inode(struct inode *inode);
int __logfs_write_inode(struct inode *inode, long flags);
void logfs_delete_inode(struct inode *inode);
void logfs_clear_inode(struct inode *inode);
/* journal.c */
void logfs_write_anchor(struct super_block *sb);
int logfs_init_journal(struct super_block *sb);
void logfs_cleanup_journal(struct super_block *sb);
int write_alias_journal(struct super_block *sb, u64 ino, u64 bix,
level_t level, int child_no, __be64 val);
void do_logfs_journal_wl_pass(struct super_block *sb);
/* readwrite.c */
pgoff_t logfs_pack_index(u64 bix, level_t level);
void logfs_unpack_index(pgoff_t index, u64 *bix, level_t *level);
int logfs_inode_write(struct inode *inode, const void *buf, size_t count,
loff_t bix, long flags, struct shadow_tree *shadow_tree);
int logfs_readpage_nolock(struct page *page);
int logfs_write_buf(struct inode *inode, struct page *page, long flags);
int logfs_delete(struct inode *inode, pgoff_t index,
struct shadow_tree *shadow_tree);
int logfs_rewrite_block(struct inode *inode, u64 bix, u64 ofs,
gc_level_t gc_level, long flags);
int logfs_is_valid_block(struct super_block *sb, u64 ofs, u64 ino, u64 bix,
gc_level_t gc_level);
int logfs_truncate(struct inode *inode, u64 size);
u64 logfs_seek_hole(struct inode *inode, u64 bix);
u64 logfs_seek_data(struct inode *inode, u64 bix);
int logfs_open_segfile(struct super_block *sb);
int logfs_init_rw(struct super_block *sb);
void logfs_cleanup_rw(struct super_block *sb);
void logfs_add_transaction(struct inode *inode, struct logfs_transaction *ta);
void logfs_del_transaction(struct inode *inode, struct logfs_transaction *ta);
void logfs_write_block(struct logfs_block *block, long flags);
int logfs_write_obj_aliases_pagecache(struct super_block *sb);
void logfs_get_segment_entry(struct super_block *sb, u32 segno,
struct logfs_segment_entry *se);
void logfs_set_segment_used(struct super_block *sb, u64 ofs, int increment);
void logfs_set_segment_erased(struct super_block *sb, u32 segno, u32 ec,
gc_level_t gc_level);
void logfs_set_segment_reserved(struct super_block *sb, u32 segno);
void logfs_set_segment_unreserved(struct super_block *sb, u32 segno, u32 ec);
struct logfs_block *__alloc_block(struct super_block *sb,
u64 ino, u64 bix, level_t level);
void __free_block(struct super_block *sb, struct logfs_block *block);
void btree_write_block(struct logfs_block *block);
void initialize_block_counters(struct page *page, struct logfs_block *block,
__be64 *array, int page_is_empty);
int logfs_exist_block(struct inode *inode, u64 bix);
int get_page_reserve(struct inode *inode, struct page *page);
extern struct logfs_block_ops indirect_block_ops;
/* segment.c */
int logfs_erase_segment(struct super_block *sb, u32 ofs, int ensure_erase);
int wbuf_read(struct super_block *sb, u64 ofs, size_t len, void *buf);
int logfs_segment_read(struct inode *inode, struct page *page, u64 ofs, u64 bix,
level_t level);
int logfs_segment_write(struct inode *inode, struct page *page,
struct logfs_shadow *shadow);
int logfs_segment_delete(struct inode *inode, struct logfs_shadow *shadow);
int logfs_load_object_aliases(struct super_block *sb,
struct logfs_obj_alias *oa, int count);
void move_page_to_btree(struct page *page);
int logfs_init_mapping(struct super_block *sb);
void logfs_sync_area(struct logfs_area *area);
void logfs_sync_segments(struct super_block *sb);
/* area handling */
int logfs_init_areas(struct super_block *sb);
void logfs_cleanup_areas(struct super_block *sb);
int logfs_open_area(struct logfs_area *area, size_t bytes);
void __logfs_buf_write(struct logfs_area *area, u64 ofs, void *buf, size_t len,
int use_filler);
static inline void logfs_buf_write(struct logfs_area *area, u64 ofs,
void *buf, size_t len)
{
__logfs_buf_write(area, ofs, buf, len, 0);
}
static inline void logfs_buf_recover(struct logfs_area *area, u64 ofs,
void *buf, size_t len)
{
__logfs_buf_write(area, ofs, buf, len, 1);
}
/* super.c */
struct page *emergency_read_begin(struct address_space *mapping, pgoff_t index);
void emergency_read_end(struct page *page);
void logfs_crash_dump(struct super_block *sb);
void *memchr_inv(const void *s, int c, size_t n);
int logfs_statfs(struct dentry *dentry, struct kstatfs *stats);
int logfs_get_sb_device(struct file_system_type *type, int flags,
struct mtd_info *mtd, struct block_device *bdev,
const struct logfs_device_ops *devops, struct vfsmount *mnt);
int logfs_check_ds(struct logfs_disk_super *ds);
int logfs_write_sb(struct super_block *sb);
static inline struct logfs_super *logfs_super(struct super_block *sb)
{
return sb->s_fs_info;
}
static inline struct logfs_inode *logfs_inode(struct inode *inode)
{
return container_of(inode, struct logfs_inode, vfs_inode);
}
static inline void logfs_set_ro(struct super_block *sb)
{
logfs_super(sb)->s_flags |= LOGFS_SB_FLAG_RO;
}
#define LOGFS_BUG(sb) do { \
struct super_block *__sb = sb; \
logfs_crash_dump(__sb); \
logfs_super(__sb)->s_flags |= LOGFS_SB_FLAG_RO; \
BUG(); \
} while (0)
#define LOGFS_BUG_ON(condition, sb) \
do { if (unlikely(condition)) LOGFS_BUG((sb)); } while (0)
static inline __be32 logfs_crc32(void *data, size_t len, size_t skip)
{
return cpu_to_be32(crc32(~0, data+skip, len-skip));
}
static inline u8 logfs_type(struct inode *inode)
{
return (inode->i_mode >> 12) & 15;
}
static inline pgoff_t logfs_index(struct super_block *sb, u64 pos)
{
return pos >> sb->s_blocksize_bits;
}
static inline u64 dev_ofs(struct super_block *sb, u32 segno, u32 ofs)
{
return ((u64)segno << logfs_super(sb)->s_segshift) + ofs;
}
static inline u32 seg_no(struct super_block *sb, u64 ofs)
{
return ofs >> logfs_super(sb)->s_segshift;
}
static inline u32 seg_ofs(struct super_block *sb, u64 ofs)
{
return ofs & logfs_super(sb)->s_segmask;
}
static inline u64 seg_align(struct super_block *sb, u64 ofs)
{
return ofs & ~logfs_super(sb)->s_segmask;
}
static inline struct logfs_block *logfs_block(struct page *page)
{
return (void *)page->private;
}
static inline level_t shrink_level(gc_level_t __level)
{
u8 level = (__force u8)__level;
if (level >= LOGFS_MAX_LEVELS)
level -= LOGFS_MAX_LEVELS;
return (__force level_t)level;
}
static inline gc_level_t expand_level(u64 ino, level_t __level)
{
u8 level = (__force u8)__level;
if (ino == LOGFS_INO_MASTER) {
/* ifile has seperate areas */
level += LOGFS_MAX_LEVELS;
}
return (__force gc_level_t)level;
}
static inline int logfs_block_shift(struct super_block *sb, level_t level)
{
level = shrink_level((__force gc_level_t)level);
return (__force int)level * (sb->s_blocksize_bits - 3);
}
static inline u64 logfs_block_mask(struct super_block *sb, level_t level)
{
return ~0ull << logfs_block_shift(sb, level);
}
static inline struct logfs_area *get_area(struct super_block *sb,
gc_level_t gc_level)
{
return logfs_super(sb)->s_area[(__force u8)gc_level];
}
#endif

629
fs/logfs/logfs_abi.h Normal file
View File

@ -0,0 +1,629 @@
/*
* fs/logfs/logfs_abi.h
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*
* Public header for logfs.
*/
#ifndef FS_LOGFS_LOGFS_ABI_H
#define FS_LOGFS_LOGFS_ABI_H
/* For out-of-kernel compiles */
#ifndef BUILD_BUG_ON
#define BUILD_BUG_ON(condition) /**/
#endif
#define SIZE_CHECK(type, size) \
static inline void check_##type(void) \
{ \
BUILD_BUG_ON(sizeof(struct type) != (size)); \
}
/*
* Throughout the logfs code, we're constantly dealing with blocks at
* various positions or offsets. To remove confusion, we stricly
* distinguish between a "position" - the logical position within a
* file and an "offset" - the physical location within the device.
*
* Any usage of the term offset for a logical location or position for
* a physical one is a bug and should get fixed.
*/
/*
* Block are allocated in one of several segments depending on their
* level. The following levels are used:
* 0 - regular data block
* 1 - i1 indirect blocks
* 2 - i2 indirect blocks
* 3 - i3 indirect blocks
* 4 - i4 indirect blocks
* 5 - i5 indirect blocks
* 6 - ifile data blocks
* 7 - ifile i1 indirect blocks
* 8 - ifile i2 indirect blocks
* 9 - ifile i3 indirect blocks
* 10 - ifile i4 indirect blocks
* 11 - ifile i5 indirect blocks
* Potential levels to be used in the future:
* 12 - gc recycled blocks, long-lived data
* 13 - replacement blocks, short-lived data
*
* Levels 1-11 are necessary for robust gc operations and help seperate
* short-lived metadata from longer-lived file data. In the future,
* file data should get seperated into several segments based on simple
* heuristics. Old data recycled during gc operation is expected to be
* long-lived. New data is of uncertain life expectancy. New data
* used to replace older blocks in existing files is expected to be
* short-lived.
*/
/* Magic numbers. 64bit for superblock, 32bit for statfs f_type */
#define LOGFS_MAGIC 0x7a3a8e5cb9d5bf67ull
#define LOGFS_MAGIC_U32 0xc97e8168u
/*
* Various blocksize related macros. Blocksize is currently fixed at 4KiB.
* Sooner or later that should become configurable and the macros replaced
* by something superblock-dependent. Pointers in indirect blocks are and
* will remain 64bit.
*
* LOGFS_BLOCKSIZE - self-explaining
* LOGFS_BLOCK_FACTOR - number of pointers per indirect block
* LOGFS_BLOCK_BITS - log2 of LOGFS_BLOCK_FACTOR, used for shifts
*/
#define LOGFS_BLOCKSIZE (4096ull)
#define LOGFS_BLOCK_FACTOR (LOGFS_BLOCKSIZE / sizeof(u64))
#define LOGFS_BLOCK_BITS (9)
/*
* Number of blocks at various levels of indirection. There are 16 direct
* block pointers plus a single indirect pointer.
*/
#define I0_BLOCKS (16)
#define I1_BLOCKS LOGFS_BLOCK_FACTOR
#define I2_BLOCKS (LOGFS_BLOCK_FACTOR * I1_BLOCKS)
#define I3_BLOCKS (LOGFS_BLOCK_FACTOR * I2_BLOCKS)
#define I4_BLOCKS (LOGFS_BLOCK_FACTOR * I3_BLOCKS)
#define I5_BLOCKS (LOGFS_BLOCK_FACTOR * I4_BLOCKS)
#define INDIRECT_INDEX I0_BLOCKS
#define LOGFS_EMBEDDED_FIELDS (I0_BLOCKS + 1)
/*
* Sizes at which files require another level of indirection. Files smaller
* than LOGFS_EMBEDDED_SIZE can be completely stored in the inode itself,
* similar like ext2 fast symlinks.
*
* Data at a position smaller than LOGFS_I0_SIZE is accessed through the
* direct pointers, else through the 1x indirect pointer and so forth.
*/
#define LOGFS_EMBEDDED_SIZE (LOGFS_EMBEDDED_FIELDS * sizeof(u64))
#define LOGFS_I0_SIZE (I0_BLOCKS * LOGFS_BLOCKSIZE)
#define LOGFS_I1_SIZE (I1_BLOCKS * LOGFS_BLOCKSIZE)
#define LOGFS_I2_SIZE (I2_BLOCKS * LOGFS_BLOCKSIZE)
#define LOGFS_I3_SIZE (I3_BLOCKS * LOGFS_BLOCKSIZE)
#define LOGFS_I4_SIZE (I4_BLOCKS * LOGFS_BLOCKSIZE)
#define LOGFS_I5_SIZE (I5_BLOCKS * LOGFS_BLOCKSIZE)
/*
* Each indirect block pointer must have this flag set, if all block pointers
* behind it are set, i.e. there is no hole hidden in the shadow of this
* indirect block pointer.
*/
#define LOGFS_FULLY_POPULATED (1ULL << 63)
#define pure_ofs(ofs) (ofs & ~LOGFS_FULLY_POPULATED)
/*
* LogFS needs to seperate data into levels. Each level is defined as the
* maximal possible distance from the master inode (inode of the inode file).
* Data blocks reside on level 0, 1x indirect block on level 1, etc.
* Inodes reside on level 6, indirect blocks for the inode file on levels 7-11.
* This effort is necessary to guarantee garbage collection to always make
* progress.
*
* LOGFS_MAX_INDIRECT is the maximal indirection through indirect blocks,
* LOGFS_MAX_LEVELS is one more for the actual data level of a file. It is
* the maximal number of levels for one file.
* LOGFS_NO_AREAS is twice that, as the inode file and regular files are
* effectively stacked on top of each other.
*/
#define LOGFS_MAX_INDIRECT (5)
#define LOGFS_MAX_LEVELS (LOGFS_MAX_INDIRECT + 1)
#define LOGFS_NO_AREAS (2 * LOGFS_MAX_LEVELS)
/* Maximum size of filenames */
#define LOGFS_MAX_NAMELEN (255)
/* Number of segments in the primary journal. */
#define LOGFS_JOURNAL_SEGS (16)
/* Maximum number of free/erased/etc. segments in journal entries */
#define MAX_CACHED_SEGS (64)
/*
* LOGFS_OBJECT_HEADERSIZE is the size of a single header in the object store,
* LOGFS_MAX_OBJECTSIZE the size of the largest possible object, including
* its header,
* LOGFS_SEGMENT_RESERVE is the amount of space reserved for each segment for
* its segment header and the padded space at the end when no further objects
* fit.
*/
#define LOGFS_OBJECT_HEADERSIZE (0x1c)
#define LOGFS_SEGMENT_HEADERSIZE (0x18)
#define LOGFS_MAX_OBJECTSIZE (LOGFS_OBJECT_HEADERSIZE + LOGFS_BLOCKSIZE)
#define LOGFS_SEGMENT_RESERVE \
(LOGFS_SEGMENT_HEADERSIZE + LOGFS_MAX_OBJECTSIZE - 1)
/*
* Segment types:
* SEG_SUPER - Data or indirect block
* SEG_JOURNAL - Inode
* SEG_OSTORE - Dentry
*/
enum {
SEG_SUPER = 0x01,
SEG_JOURNAL = 0x02,
SEG_OSTORE = 0x03,
};
/**
* struct logfs_segment_header - per-segment header in the ostore
*
* @crc: crc32 of header (there is no data)
* @pad: unused, must be 0
* @type: segment type, see above
* @level: GC level for all objects in this segment
* @segno: segment number
* @ec: erase count for this segment
* @gec: global erase count at time of writing
*/
struct logfs_segment_header {
__be32 crc;
__be16 pad;
__u8 type;
__u8 level;
__be32 segno;
__be32 ec;
__be64 gec;
};
SIZE_CHECK(logfs_segment_header, LOGFS_SEGMENT_HEADERSIZE);
#define LOGFS_FEATURES_INCOMPAT (0ull)
#define LOGFS_FEATURES_RO_COMPAT (0ull)
#define LOGFS_FEATURES_COMPAT (0ull)
/**
* struct logfs_disk_super - on-medium superblock
*
* @ds_magic: magic number, must equal LOGFS_MAGIC
* @ds_crc: crc32 of structure starting with the next field
* @ds_ifile_levels: maximum number of levels for ifile
* @ds_iblock_levels: maximum number of levels for regular files
* @ds_data_levels: number of seperate levels for data
* @pad0: reserved, must be 0
* @ds_feature_incompat: incompatible filesystem features
* @ds_feature_ro_compat: read-only compatible filesystem features
* @ds_feature_compat: compatible filesystem features
* @ds_flags: flags
* @ds_segment_shift: log2 of segment size
* @ds_block_shift: log2 of block size
* @ds_write_shift: log2 of write size
* @pad1: reserved, must be 0
* @ds_journal_seg: segments used by primary journal
* @ds_root_reserve: bytes reserved for the superuser
* @ds_speed_reserve: bytes reserved to speed up GC
* @ds_bad_seg_reserve: number of segments reserved to handle bad blocks
* @pad2: reserved, must be 0
* @pad3: reserved, must be 0
*
* Contains only read-only fields. Read-write fields like the amount of used
* space is tracked in the dynamic superblock, which is stored in the journal.
*/
struct logfs_disk_super {
struct logfs_segment_header ds_sh;
__be64 ds_magic;
__be32 ds_crc;
__u8 ds_ifile_levels;
__u8 ds_iblock_levels;
__u8 ds_data_levels;
__u8 ds_segment_shift;
__u8 ds_block_shift;
__u8 ds_write_shift;
__u8 pad0[6];
__be64 ds_filesystem_size;
__be32 ds_segment_size;
__be32 ds_bad_seg_reserve;
__be64 ds_feature_incompat;
__be64 ds_feature_ro_compat;
__be64 ds_feature_compat;
__be64 ds_feature_flags;
__be64 ds_root_reserve;
__be64 ds_speed_reserve;
__be32 ds_journal_seg[LOGFS_JOURNAL_SEGS];
__be64 ds_super_ofs[2];
__be64 pad3[8];
};
SIZE_CHECK(logfs_disk_super, 256);
/*
* Object types:
* OBJ_BLOCK - Data or indirect block
* OBJ_INODE - Inode
* OBJ_DENTRY - Dentry
*/
enum {
OBJ_BLOCK = 0x04,
OBJ_INODE = 0x05,
OBJ_DENTRY = 0x06,
};
/**
* struct logfs_object_header - per-object header in the ostore
*
* @crc: crc32 of header, excluding data_crc
* @len: length of data
* @type: object type, see above
* @compr: compression type
* @ino: inode number
* @bix: block index
* @data_crc: crc32 of payload
*/
struct logfs_object_header {
__be32 crc;
__be16 len;
__u8 type;
__u8 compr;
__be64 ino;
__be64 bix;
__be32 data_crc;
} __attribute__((packed));
SIZE_CHECK(logfs_object_header, LOGFS_OBJECT_HEADERSIZE);
/*
* Reserved inode numbers:
* LOGFS_INO_MASTER - master inode (for inode file)
* LOGFS_INO_ROOT - root directory
* LOGFS_INO_SEGFILE - per-segment used bytes and erase count
*/
enum {
LOGFS_INO_MAPPING = 0x00,
LOGFS_INO_MASTER = 0x01,
LOGFS_INO_ROOT = 0x02,
LOGFS_INO_SEGFILE = 0x03,
LOGFS_RESERVED_INOS = 0x10,
};
/*
* Inode flags. High bits should never be written to the medium. They are
* reserved for in-memory usage.
* Low bits should either remain in sync with the corresponding FS_*_FL or
* reuse slots that obviously don't make sense for logfs.
*
* LOGFS_IF_DIRTY Inode must be written back
* LOGFS_IF_ZOMBIE Inode has been deleted
* LOGFS_IF_STILLBORN -ENOSPC happened when creating inode
*/
#define LOGFS_IF_COMPRESSED 0x00000004 /* == FS_COMPR_FL */
#define LOGFS_IF_DIRTY 0x20000000
#define LOGFS_IF_ZOMBIE 0x40000000
#define LOGFS_IF_STILLBORN 0x80000000
/* Flags available to chattr */
#define LOGFS_FL_USER_VISIBLE (LOGFS_IF_COMPRESSED)
#define LOGFS_FL_USER_MODIFIABLE (LOGFS_IF_COMPRESSED)
/* Flags inherited from parent directory on file/directory creation */
#define LOGFS_FL_INHERITED (LOGFS_IF_COMPRESSED)
/**
* struct logfs_disk_inode - on-medium inode
*
* @di_mode: file mode
* @di_pad: reserved, must be 0
* @di_flags: inode flags, see above
* @di_uid: user id
* @di_gid: group id
* @di_ctime: change time
* @di_mtime: modify time
* @di_refcount: reference count (aka nlink or link count)
* @di_generation: inode generation, for nfs
* @di_used_bytes: number of bytes used
* @di_size: file size
* @di_data: data pointers
*/
struct logfs_disk_inode {
__be16 di_mode;
__u8 di_height;
__u8 di_pad;
__be32 di_flags;
__be32 di_uid;
__be32 di_gid;
__be64 di_ctime;
__be64 di_mtime;
__be64 di_atime;
__be32 di_refcount;
__be32 di_generation;
__be64 di_used_bytes;
__be64 di_size;
__be64 di_data[LOGFS_EMBEDDED_FIELDS];
};
SIZE_CHECK(logfs_disk_inode, 200);
#define INODE_POINTER_OFS \
(offsetof(struct logfs_disk_inode, di_data) / sizeof(__be64))
#define INODE_USED_OFS \
(offsetof(struct logfs_disk_inode, di_used_bytes) / sizeof(__be64))
#define INODE_SIZE_OFS \
(offsetof(struct logfs_disk_inode, di_size) / sizeof(__be64))
#define INODE_HEIGHT_OFS (0)
/**
* struct logfs_disk_dentry - on-medium dentry structure
*
* @ino: inode number
* @namelen: length of file name
* @type: file type, identical to bits 12..15 of mode
* @name: file name
*/
/* FIXME: add 6 bytes of padding to remove the __packed */
struct logfs_disk_dentry {
__be64 ino;
__be16 namelen;
__u8 type;
__u8 name[LOGFS_MAX_NAMELEN];
} __attribute__((packed));
SIZE_CHECK(logfs_disk_dentry, 266);
#define RESERVED 0xffffffff
#define BADSEG 0xffffffff
/**
* struct logfs_segment_entry - segment file entry
*
* @ec_level: erase count and level
* @valid: number of valid bytes
*
* Segment file contains one entry for every segment. ec_level contains the
* erasecount in the upper 28 bits and the level in the lower 4 bits. An
* ec_level of BADSEG (-1) identifies bad segments. valid contains the number
* of valid bytes or RESERVED (-1 again) if the segment is used for either the
* superblock or the journal, or when the segment is bad.
*/
struct logfs_segment_entry {
__be32 ec_level;
__be32 valid;
};
SIZE_CHECK(logfs_segment_entry, 8);
/**
* struct logfs_journal_header - header for journal entries (JEs)
*
* @h_crc: crc32 of journal entry
* @h_len: length of compressed journal entry,
* not including header
* @h_datalen: length of uncompressed data
* @h_type: JE type
* @h_compr: compression type
* @h_pad: reserved
*/
struct logfs_journal_header {
__be32 h_crc;
__be16 h_len;
__be16 h_datalen;
__be16 h_type;
__u8 h_compr;
__u8 h_pad[5];
};
SIZE_CHECK(logfs_journal_header, 16);
/*
* Life expectency of data.
* VIM_DEFAULT - default vim
* VIM_SEGFILE - for segment file only - very short-living
* VIM_GC - GC'd data - likely long-living
*/
enum logfs_vim {
VIM_DEFAULT = 0,
VIM_SEGFILE = 1,
};
/**
* struct logfs_je_area - wbuf header
*
* @segno: segment number of area
* @used_bytes: number of bytes already used
* @gc_level: GC level
* @vim: life expectancy of data
*
* "Areas" are segments currently being used for writing. There is at least
* one area per GC level. Several may be used to seperate long-living from
* short-living data. If an area with unknown vim is encountered, it can
* simply be closed.
* The write buffer immediately follow this header.
*/
struct logfs_je_area {
__be32 segno;
__be32 used_bytes;
__u8 gc_level;
__u8 vim;
} __attribute__((packed));
SIZE_CHECK(logfs_je_area, 10);
#define MAX_JOURNAL_HEADER \
(sizeof(struct logfs_journal_header) + sizeof(struct logfs_je_area))
/**
* struct logfs_je_dynsb - dynamic superblock
*
* @ds_gec: global erase count
* @ds_sweeper: current position of GC "sweeper"
* @ds_rename_dir: source directory ino (see dir.c documentation)
* @ds_rename_pos: position of source dd (see dir.c documentation)
* @ds_victim_ino: victims of incomplete dir operation (see dir.c)
* @ds_victim_ino: parent inode of victim (see dir.c)
* @ds_used_bytes: number of used bytes
*/
struct logfs_je_dynsb {
__be64 ds_gec;
__be64 ds_sweeper;
__be64 ds_rename_dir;
__be64 ds_rename_pos;
__be64 ds_victim_ino;
__be64 ds_victim_parent; /* XXX */
__be64 ds_used_bytes;
__be32 ds_generation;
__be32 pad;
};
SIZE_CHECK(logfs_je_dynsb, 64);
/**
* struct logfs_je_anchor - anchor of filesystem tree, aka master inode
*
* @da_size: size of inode file
* @da_last_ino: last created inode
* @da_used_bytes: number of bytes used
* @da_data: data pointers
*/
struct logfs_je_anchor {
__be64 da_size;
__be64 da_last_ino;
__be64 da_used_bytes;
u8 da_height;
u8 pad[7];
__be64 da_data[LOGFS_EMBEDDED_FIELDS];
};
SIZE_CHECK(logfs_je_anchor, 168);
/**
* struct logfs_je_spillout - spillout entry (from 1st to 2nd journal)
*
* @so_segment: segments used for 2nd journal
*
* Length of the array is given by h_len field in the header.
*/
struct logfs_je_spillout {
__be64 so_segment[0];
};
SIZE_CHECK(logfs_je_spillout, 0);
/**
* struct logfs_je_journal_ec - erase counts for all journal segments
*
* @ec: erase count
*
* Length of the array is given by h_len field in the header.
*/
struct logfs_je_journal_ec {
__be32 ec[0];
};
SIZE_CHECK(logfs_je_journal_ec, 0);
/**
* struct logfs_je_free_segments - list of free segmetns with erase count
*/
struct logfs_je_free_segments {
__be32 segno;
__be32 ec;
};
SIZE_CHECK(logfs_je_free_segments, 8);
/**
* struct logfs_seg_alias - list of segment aliases
*/
struct logfs_seg_alias {
__be32 old_segno;
__be32 new_segno;
};
SIZE_CHECK(logfs_seg_alias, 8);
/**
* struct logfs_obj_alias - list of object aliases
*/
struct logfs_obj_alias {
__be64 ino;
__be64 bix;
__be64 val;
u8 level;
u8 pad[5];
__be16 child_no;
};
SIZE_CHECK(logfs_obj_alias, 32);
/**
* Compression types.
*
* COMPR_NONE - uncompressed
* COMPR_ZLIB - compressed with zlib
*/
enum {
COMPR_NONE = 0,
COMPR_ZLIB = 1,
};
/*
* Journal entries come in groups of 16. First group contains unique
* entries, next groups contain one entry per level
*
* JE_FIRST - smallest possible journal entry number
*
* JEG_BASE - base group, containing unique entries
* JE_COMMIT - commit entry, validates all previous entries
* JE_DYNSB - dynamic superblock, anything that ought to be in the
* superblock but cannot because it is read-write data
* JE_ANCHOR - anchor aka master inode aka inode file's inode
* JE_ERASECOUNT erasecounts for all journal segments
* JE_SPILLOUT - unused
* JE_SEG_ALIAS - aliases segments
* JE_AREA - area description
*
* JE_LAST - largest possible journal entry number
*/
enum {
JE_FIRST = 0x01,
JEG_BASE = 0x00,
JE_COMMIT = 0x02,
JE_DYNSB = 0x03,
JE_ANCHOR = 0x04,
JE_ERASECOUNT = 0x05,
JE_SPILLOUT = 0x06,
JE_OBJ_ALIAS = 0x0d,
JE_AREA = 0x0e,
JE_LAST = 0x0e,
};
#endif

2246
fs/logfs/readwrite.c Normal file

File diff suppressed because it is too large Load Diff

927
fs/logfs/segment.c Normal file
View File

@ -0,0 +1,927 @@
/*
* fs/logfs/segment.c - Handling the Object Store
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*
* Object store or ostore makes up the complete device with exception of
* the superblock and journal areas. Apart from its own metadata it stores
* three kinds of objects: inodes, dentries and blocks, both data and indirect.
*/
#include "logfs.h"
static int logfs_mark_segment_bad(struct super_block *sb, u32 segno)
{
struct logfs_super *super = logfs_super(sb);
struct btree_head32 *head = &super->s_reserved_segments;
int err;
err = btree_insert32(head, segno, (void *)1, GFP_NOFS);
if (err)
return err;
logfs_super(sb)->s_bad_segments++;
/* FIXME: write to journal */
return 0;
}
int logfs_erase_segment(struct super_block *sb, u32 segno, int ensure_erase)
{
struct logfs_super *super = logfs_super(sb);
super->s_gec++;
return super->s_devops->erase(sb, (u64)segno << super->s_segshift,
super->s_segsize, ensure_erase);
}
static s64 logfs_get_free_bytes(struct logfs_area *area, size_t bytes)
{
s32 ofs;
logfs_open_area(area, bytes);
ofs = area->a_used_bytes;
area->a_used_bytes += bytes;
BUG_ON(area->a_used_bytes >= logfs_super(area->a_sb)->s_segsize);
return dev_ofs(area->a_sb, area->a_segno, ofs);
}
static struct page *get_mapping_page(struct super_block *sb, pgoff_t index,
int use_filler)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
filler_t *filler = super->s_devops->readpage;
struct page *page;
BUG_ON(mapping_gfp_mask(mapping) & __GFP_FS);
if (use_filler)
page = read_cache_page(mapping, index, filler, sb);
else {
page = find_or_create_page(mapping, index, GFP_NOFS);
unlock_page(page);
}
return page;
}
void __logfs_buf_write(struct logfs_area *area, u64 ofs, void *buf, size_t len,
int use_filler)
{
pgoff_t index = ofs >> PAGE_SHIFT;
struct page *page;
long offset = ofs & (PAGE_SIZE-1);
long copylen;
/* Only logfs_wbuf_recover may use len==0 */
BUG_ON(!len && !use_filler);
do {
copylen = min((ulong)len, PAGE_SIZE - offset);
page = get_mapping_page(area->a_sb, index, use_filler);
SetPageUptodate(page);
BUG_ON(!page); /* FIXME: reserve a pool */
memcpy(page_address(page) + offset, buf, copylen);
SetPagePrivate(page);
page_cache_release(page);
buf += copylen;
len -= copylen;
offset = 0;
index++;
} while (len);
}
/*
* bdev_writeseg will write full pages. Memset the tail to prevent data leaks.
*/
static void pad_wbuf(struct logfs_area *area, int final)
{
struct super_block *sb = area->a_sb;
struct logfs_super *super = logfs_super(sb);
struct page *page;
u64 ofs = dev_ofs(sb, area->a_segno, area->a_used_bytes);
pgoff_t index = ofs >> PAGE_SHIFT;
long offset = ofs & (PAGE_SIZE-1);
u32 len = PAGE_SIZE - offset;
if (len == PAGE_SIZE) {
/* The math in this function can surely use some love */
len = 0;
}
if (len) {
BUG_ON(area->a_used_bytes >= super->s_segsize);
page = get_mapping_page(area->a_sb, index, 0);
BUG_ON(!page); /* FIXME: reserve a pool */
memset(page_address(page) + offset, 0xff, len);
SetPagePrivate(page);
page_cache_release(page);
}
if (!final)
return;
area->a_used_bytes += len;
for ( ; area->a_used_bytes < super->s_segsize;
area->a_used_bytes += PAGE_SIZE) {
/* Memset another page */
index++;
page = get_mapping_page(area->a_sb, index, 0);
BUG_ON(!page); /* FIXME: reserve a pool */
memset(page_address(page), 0xff, PAGE_SIZE);
SetPagePrivate(page);
page_cache_release(page);
}
}
/*
* We have to be careful with the alias tree. Since lookup is done by bix,
* it needs to be normalized, so 14, 15, 16, etc. all match when dealing with
* indirect blocks. So always use it through accessor functions.
*/
static void *alias_tree_lookup(struct super_block *sb, u64 ino, u64 bix,
level_t level)
{
struct btree_head128 *head = &logfs_super(sb)->s_object_alias_tree;
pgoff_t index = logfs_pack_index(bix, level);
return btree_lookup128(head, ino, index);
}
static int alias_tree_insert(struct super_block *sb, u64 ino, u64 bix,
level_t level, void *val)
{
struct btree_head128 *head = &logfs_super(sb)->s_object_alias_tree;
pgoff_t index = logfs_pack_index(bix, level);
return btree_insert128(head, ino, index, val, GFP_NOFS);
}
static int btree_write_alias(struct super_block *sb, struct logfs_block *block,
write_alias_t *write_one_alias)
{
struct object_alias_item *item;
int err;
list_for_each_entry(item, &block->item_list, list) {
err = write_alias_journal(sb, block->ino, block->bix,
block->level, item->child_no, item->val);
if (err)
return err;
}
return 0;
}
static gc_level_t btree_block_level(struct logfs_block *block)
{
return expand_level(block->ino, block->level);
}
static struct logfs_block_ops btree_block_ops = {
.write_block = btree_write_block,
.block_level = btree_block_level,
.free_block = __free_block,
.write_alias = btree_write_alias,
};
int logfs_load_object_aliases(struct super_block *sb,
struct logfs_obj_alias *oa, int count)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_block *block;
struct object_alias_item *item;
u64 ino, bix;
level_t level;
int i, err;
super->s_flags |= LOGFS_SB_FLAG_OBJ_ALIAS;
count /= sizeof(*oa);
for (i = 0; i < count; i++) {
item = mempool_alloc(super->s_alias_pool, GFP_NOFS);
if (!item)
return -ENOMEM;
memset(item, 0, sizeof(*item));
super->s_no_object_aliases++;
item->val = oa[i].val;
item->child_no = be16_to_cpu(oa[i].child_no);
ino = be64_to_cpu(oa[i].ino);
bix = be64_to_cpu(oa[i].bix);
level = LEVEL(oa[i].level);
log_aliases("logfs_load_object_aliases(%llx, %llx, %x, %x) %llx\n",
ino, bix, level, item->child_no,
be64_to_cpu(item->val));
block = alias_tree_lookup(sb, ino, bix, level);
if (!block) {
block = __alloc_block(sb, ino, bix, level);
block->ops = &btree_block_ops;
err = alias_tree_insert(sb, ino, bix, level, block);
BUG_ON(err); /* mempool empty */
}
if (test_and_set_bit(item->child_no, block->alias_map)) {
printk(KERN_ERR"LogFS: Alias collision detected\n");
return -EIO;
}
list_move_tail(&block->alias_list, &super->s_object_alias);
list_add(&item->list, &block->item_list);
}
return 0;
}
static void kill_alias(void *_block, unsigned long ignore0,
u64 ignore1, u64 ignore2, size_t ignore3)
{
struct logfs_block *block = _block;
struct super_block *sb = block->sb;
struct logfs_super *super = logfs_super(sb);
struct object_alias_item *item;
while (!list_empty(&block->item_list)) {
item = list_entry(block->item_list.next, typeof(*item), list);
list_del(&item->list);
mempool_free(item, super->s_alias_pool);
}
block->ops->free_block(sb, block);
}
static int obj_type(struct inode *inode, level_t level)
{
if (level == 0) {
if (S_ISDIR(inode->i_mode))
return OBJ_DENTRY;
if (inode->i_ino == LOGFS_INO_MASTER)
return OBJ_INODE;
}
return OBJ_BLOCK;
}
static int obj_len(struct super_block *sb, int obj_type)
{
switch (obj_type) {
case OBJ_DENTRY:
return sizeof(struct logfs_disk_dentry);
case OBJ_INODE:
return sizeof(struct logfs_disk_inode);
case OBJ_BLOCK:
return sb->s_blocksize;
default:
BUG();
}
}
static int __logfs_segment_write(struct inode *inode, void *buf,
struct logfs_shadow *shadow, int type, int len, int compr)
{
struct logfs_area *area;
struct super_block *sb = inode->i_sb;
s64 ofs;
struct logfs_object_header h;
int acc_len;
if (shadow->gc_level == 0)
acc_len = len;
else
acc_len = obj_len(sb, type);
area = get_area(sb, shadow->gc_level);
ofs = logfs_get_free_bytes(area, len + LOGFS_OBJECT_HEADERSIZE);
LOGFS_BUG_ON(ofs <= 0, sb);
/*
* Order is important. logfs_get_free_bytes(), by modifying the
* segment file, may modify the content of the very page we're about
* to write now. Which is fine, as long as the calculated crc and
* written data still match. So do the modifications _before_
* calculating the crc.
*/
h.len = cpu_to_be16(len);
h.type = type;
h.compr = compr;
h.ino = cpu_to_be64(inode->i_ino);
h.bix = cpu_to_be64(shadow->bix);
h.crc = logfs_crc32(&h, sizeof(h) - 4, 4);
h.data_crc = logfs_crc32(buf, len, 0);
logfs_buf_write(area, ofs, &h, sizeof(h));
logfs_buf_write(area, ofs + LOGFS_OBJECT_HEADERSIZE, buf, len);
shadow->new_ofs = ofs;
shadow->new_len = acc_len + LOGFS_OBJECT_HEADERSIZE;
return 0;
}
static s64 logfs_segment_write_compress(struct inode *inode, void *buf,
struct logfs_shadow *shadow, int type, int len)
{
struct super_block *sb = inode->i_sb;
void *compressor_buf = logfs_super(sb)->s_compressed_je;
ssize_t compr_len;
int ret;
mutex_lock(&logfs_super(sb)->s_journal_mutex);
compr_len = logfs_compress(buf, compressor_buf, len, len);
if (compr_len >= 0) {
ret = __logfs_segment_write(inode, compressor_buf, shadow,
type, compr_len, COMPR_ZLIB);
} else {
ret = __logfs_segment_write(inode, buf, shadow, type, len,
COMPR_NONE);
}
mutex_unlock(&logfs_super(sb)->s_journal_mutex);
return ret;
}
/**
* logfs_segment_write - write data block to object store
* @inode: inode containing data
*
* Returns an errno or zero.
*/
int logfs_segment_write(struct inode *inode, struct page *page,
struct logfs_shadow *shadow)
{
struct super_block *sb = inode->i_sb;
struct logfs_super *super = logfs_super(sb);
int do_compress, type, len;
int ret;
void *buf;
super->s_flags |= LOGFS_SB_FLAG_DIRTY;
BUG_ON(super->s_flags & LOGFS_SB_FLAG_SHUTDOWN);
do_compress = logfs_inode(inode)->li_flags & LOGFS_IF_COMPRESSED;
if (shadow->gc_level != 0) {
/* temporarily disable compression for indirect blocks */
do_compress = 0;
}
type = obj_type(inode, shrink_level(shadow->gc_level));
len = obj_len(sb, type);
buf = kmap(page);
if (do_compress)
ret = logfs_segment_write_compress(inode, buf, shadow, type,
len);
else
ret = __logfs_segment_write(inode, buf, shadow, type, len,
COMPR_NONE);
kunmap(page);
log_segment("logfs_segment_write(%llx, %llx, %x) %llx->%llx %x->%x\n",
shadow->ino, shadow->bix, shadow->gc_level,
shadow->old_ofs, shadow->new_ofs,
shadow->old_len, shadow->new_len);
/* this BUG_ON did catch a locking bug. useful */
BUG_ON(!(shadow->new_ofs & (super->s_segsize - 1)));
return ret;
}
int wbuf_read(struct super_block *sb, u64 ofs, size_t len, void *buf)
{
pgoff_t index = ofs >> PAGE_SHIFT;
struct page *page;
long offset = ofs & (PAGE_SIZE-1);
long copylen;
while (len) {
copylen = min((ulong)len, PAGE_SIZE - offset);
page = get_mapping_page(sb, index, 1);
if (IS_ERR(page))
return PTR_ERR(page);
memcpy(buf, page_address(page) + offset, copylen);
page_cache_release(page);
buf += copylen;
len -= copylen;
offset = 0;
index++;
}
return 0;
}
/*
* The "position" of indirect blocks is ambiguous. It can be the position
* of any data block somewhere behind this indirect block. So we need to
* normalize the positions through logfs_block_mask() before comparing.
*/
static int check_pos(struct super_block *sb, u64 pos1, u64 pos2, level_t level)
{
return (pos1 & logfs_block_mask(sb, level)) !=
(pos2 & logfs_block_mask(sb, level));
}
#if 0
static int read_seg_header(struct super_block *sb, u64 ofs,
struct logfs_segment_header *sh)
{
__be32 crc;
int err;
err = wbuf_read(sb, ofs, sizeof(*sh), sh);
if (err)
return err;
crc = logfs_crc32(sh, sizeof(*sh), 4);
if (crc != sh->crc) {
printk(KERN_ERR"LOGFS: header crc error at %llx: expected %x, "
"got %x\n", ofs, be32_to_cpu(sh->crc),
be32_to_cpu(crc));
return -EIO;
}
return 0;
}
#endif
static int read_obj_header(struct super_block *sb, u64 ofs,
struct logfs_object_header *oh)
{
__be32 crc;
int err;
err = wbuf_read(sb, ofs, sizeof(*oh), oh);
if (err)
return err;
crc = logfs_crc32(oh, sizeof(*oh) - 4, 4);
if (crc != oh->crc) {
printk(KERN_ERR"LOGFS: header crc error at %llx: expected %x, "
"got %x\n", ofs, be32_to_cpu(oh->crc),
be32_to_cpu(crc));
return -EIO;
}
return 0;
}
static void move_btree_to_page(struct inode *inode, struct page *page,
__be64 *data)
{
struct super_block *sb = inode->i_sb;
struct logfs_super *super = logfs_super(sb);
struct btree_head128 *head = &super->s_object_alias_tree;
struct logfs_block *block;
struct object_alias_item *item, *next;
if (!(super->s_flags & LOGFS_SB_FLAG_OBJ_ALIAS))
return;
block = btree_remove128(head, inode->i_ino, page->index);
if (!block)
return;
log_blockmove("move_btree_to_page(%llx, %llx, %x)\n",
block->ino, block->bix, block->level);
list_for_each_entry_safe(item, next, &block->item_list, list) {
data[item->child_no] = item->val;
list_del(&item->list);
mempool_free(item, super->s_alias_pool);
}
block->page = page;
SetPagePrivate(page);
page->private = (unsigned long)block;
block->ops = &indirect_block_ops;
initialize_block_counters(page, block, data, 0);
}
/*
* This silences a false, yet annoying gcc warning. I hate it when my editor
* jumps into bitops.h each time I recompile this file.
* TODO: Complain to gcc folks about this and upgrade compiler.
*/
static unsigned long fnb(const unsigned long *addr,
unsigned long size, unsigned long offset)
{
return find_next_bit(addr, size, offset);
}
void move_page_to_btree(struct page *page)
{
struct logfs_block *block = logfs_block(page);
struct super_block *sb = block->sb;
struct logfs_super *super = logfs_super(sb);
struct object_alias_item *item;
unsigned long pos;
__be64 *child;
int err;
if (super->s_flags & LOGFS_SB_FLAG_SHUTDOWN) {
block->ops->free_block(sb, block);
return;
}
log_blockmove("move_page_to_btree(%llx, %llx, %x)\n",
block->ino, block->bix, block->level);
super->s_flags |= LOGFS_SB_FLAG_OBJ_ALIAS;
for (pos = 0; ; pos++) {
pos = fnb(block->alias_map, LOGFS_BLOCK_FACTOR, pos);
if (pos >= LOGFS_BLOCK_FACTOR)
break;
item = mempool_alloc(super->s_alias_pool, GFP_NOFS);
BUG_ON(!item); /* mempool empty */
memset(item, 0, sizeof(*item));
child = kmap_atomic(page, KM_USER0);
item->val = child[pos];
kunmap_atomic(child, KM_USER0);
item->child_no = pos;
list_add(&item->list, &block->item_list);
}
block->page = NULL;
ClearPagePrivate(page);
page->private = 0;
block->ops = &btree_block_ops;
err = alias_tree_insert(block->sb, block->ino, block->bix, block->level,
block);
BUG_ON(err); /* mempool empty */
ClearPageUptodate(page);
}
static int __logfs_segment_read(struct inode *inode, void *buf,
u64 ofs, u64 bix, level_t level)
{
struct super_block *sb = inode->i_sb;
void *compressor_buf = logfs_super(sb)->s_compressed_je;
struct logfs_object_header oh;
__be32 crc;
u16 len;
int err, block_len;
block_len = obj_len(sb, obj_type(inode, level));
err = read_obj_header(sb, ofs, &oh);
if (err)
goto out_err;
err = -EIO;
if (be64_to_cpu(oh.ino) != inode->i_ino
|| check_pos(sb, be64_to_cpu(oh.bix), bix, level)) {
printk(KERN_ERR"LOGFS: (ino, bix) don't match at %llx: "
"expected (%lx, %llx), got (%llx, %llx)\n",
ofs, inode->i_ino, bix,
be64_to_cpu(oh.ino), be64_to_cpu(oh.bix));
goto out_err;
}
len = be16_to_cpu(oh.len);
switch (oh.compr) {
case COMPR_NONE:
err = wbuf_read(sb, ofs + LOGFS_OBJECT_HEADERSIZE, len, buf);
if (err)
goto out_err;
crc = logfs_crc32(buf, len, 0);
if (crc != oh.data_crc) {
printk(KERN_ERR"LOGFS: uncompressed data crc error at "
"%llx: expected %x, got %x\n", ofs,
be32_to_cpu(oh.data_crc),
be32_to_cpu(crc));
goto out_err;
}
break;
case COMPR_ZLIB:
mutex_lock(&logfs_super(sb)->s_journal_mutex);
err = wbuf_read(sb, ofs + LOGFS_OBJECT_HEADERSIZE, len,
compressor_buf);
if (err) {
mutex_unlock(&logfs_super(sb)->s_journal_mutex);
goto out_err;
}
crc = logfs_crc32(compressor_buf, len, 0);
if (crc != oh.data_crc) {
printk(KERN_ERR"LOGFS: compressed data crc error at "
"%llx: expected %x, got %x\n", ofs,
be32_to_cpu(oh.data_crc),
be32_to_cpu(crc));
mutex_unlock(&logfs_super(sb)->s_journal_mutex);
goto out_err;
}
err = logfs_uncompress(compressor_buf, buf, len, block_len);
mutex_unlock(&logfs_super(sb)->s_journal_mutex);
if (err) {
printk(KERN_ERR"LOGFS: uncompress error at %llx\n", ofs);
goto out_err;
}
break;
default:
LOGFS_BUG(sb);
err = -EIO;
goto out_err;
}
return 0;
out_err:
logfs_set_ro(sb);
printk(KERN_ERR"LOGFS: device is read-only now\n");
LOGFS_BUG(sb);
return err;
}
/**
* logfs_segment_read - read data block from object store
* @inode: inode containing data
* @buf: data buffer
* @ofs: physical data offset
* @bix: block index
* @level: block level
*
* Returns 0 on success or a negative errno.
*/
int logfs_segment_read(struct inode *inode, struct page *page,
u64 ofs, u64 bix, level_t level)
{
int err;
void *buf;
if (PageUptodate(page))
return 0;
ofs &= ~LOGFS_FULLY_POPULATED;
buf = kmap(page);
err = __logfs_segment_read(inode, buf, ofs, bix, level);
if (!err) {
move_btree_to_page(inode, page, buf);
SetPageUptodate(page);
}
kunmap(page);
log_segment("logfs_segment_read(%lx, %llx, %x) %llx (%d)\n",
inode->i_ino, bix, level, ofs, err);
return err;
}
int logfs_segment_delete(struct inode *inode, struct logfs_shadow *shadow)
{
struct super_block *sb = inode->i_sb;
struct logfs_super *super = logfs_super(sb);
struct logfs_object_header h;
u16 len;
int err;
super->s_flags |= LOGFS_SB_FLAG_DIRTY;
BUG_ON(super->s_flags & LOGFS_SB_FLAG_SHUTDOWN);
BUG_ON(shadow->old_ofs & LOGFS_FULLY_POPULATED);
if (!shadow->old_ofs)
return 0;
log_segment("logfs_segment_delete(%llx, %llx, %x) %llx->%llx %x->%x\n",
shadow->ino, shadow->bix, shadow->gc_level,
shadow->old_ofs, shadow->new_ofs,
shadow->old_len, shadow->new_len);
err = read_obj_header(sb, shadow->old_ofs, &h);
LOGFS_BUG_ON(err, sb);
LOGFS_BUG_ON(be64_to_cpu(h.ino) != inode->i_ino, sb);
LOGFS_BUG_ON(check_pos(sb, shadow->bix, be64_to_cpu(h.bix),
shrink_level(shadow->gc_level)), sb);
if (shadow->gc_level == 0)
len = be16_to_cpu(h.len);
else
len = obj_len(sb, h.type);
shadow->old_len = len + sizeof(h);
return 0;
}
static void freeseg(struct super_block *sb, u32 segno)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping = super->s_mapping_inode->i_mapping;
struct page *page;
u64 ofs, start, end;
start = dev_ofs(sb, segno, 0);
end = dev_ofs(sb, segno + 1, 0);
for (ofs = start; ofs < end; ofs += PAGE_SIZE) {
page = find_get_page(mapping, ofs >> PAGE_SHIFT);
if (!page)
continue;
ClearPagePrivate(page);
page_cache_release(page);
}
}
int logfs_open_area(struct logfs_area *area, size_t bytes)
{
struct super_block *sb = area->a_sb;
struct logfs_super *super = logfs_super(sb);
int err, closed = 0;
if (area->a_is_open && area->a_used_bytes + bytes <= super->s_segsize)
return 0;
if (area->a_is_open) {
u64 ofs = dev_ofs(sb, area->a_segno, area->a_written_bytes);
u32 len = super->s_segsize - area->a_written_bytes;
log_gc("logfs_close_area(%x)\n", area->a_segno);
pad_wbuf(area, 1);
super->s_devops->writeseg(area->a_sb, ofs, len);
freeseg(sb, area->a_segno);
closed = 1;
}
area->a_used_bytes = 0;
area->a_written_bytes = 0;
again:
area->a_ops->get_free_segment(area);
area->a_ops->get_erase_count(area);
log_gc("logfs_open_area(%x, %x)\n", area->a_segno, area->a_level);
err = area->a_ops->erase_segment(area);
if (err) {
printk(KERN_WARNING "LogFS: Error erasing segment %x\n",
area->a_segno);
logfs_mark_segment_bad(sb, area->a_segno);
goto again;
}
area->a_is_open = 1;
return closed;
}
void logfs_sync_area(struct logfs_area *area)
{
struct super_block *sb = area->a_sb;
struct logfs_super *super = logfs_super(sb);
u64 ofs = dev_ofs(sb, area->a_segno, area->a_written_bytes);
u32 len = (area->a_used_bytes - area->a_written_bytes);
if (super->s_writesize)
len &= ~(super->s_writesize - 1);
if (len == 0)
return;
pad_wbuf(area, 0);
super->s_devops->writeseg(sb, ofs, len);
area->a_written_bytes += len;
}
void logfs_sync_segments(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int i;
for_each_area(i)
logfs_sync_area(super->s_area[i]);
}
/*
* Pick a free segment to be used for this area. Effectively takes a
* candidate from the free list (not really a candidate anymore).
*/
static void ostore_get_free_segment(struct logfs_area *area)
{
struct super_block *sb = area->a_sb;
struct logfs_super *super = logfs_super(sb);
if (super->s_free_list.count == 0) {
printk(KERN_ERR"LOGFS: ran out of free segments\n");
LOGFS_BUG(sb);
}
area->a_segno = get_best_cand(sb, &super->s_free_list, NULL);
}
static void ostore_get_erase_count(struct logfs_area *area)
{
struct logfs_segment_entry se;
u32 ec_level;
logfs_get_segment_entry(area->a_sb, area->a_segno, &se);
BUG_ON(se.ec_level == cpu_to_be32(BADSEG) ||
se.valid == cpu_to_be32(RESERVED));
ec_level = be32_to_cpu(se.ec_level);
area->a_erase_count = (ec_level >> 4) + 1;
}
static int ostore_erase_segment(struct logfs_area *area)
{
struct super_block *sb = area->a_sb;
struct logfs_segment_header sh;
u64 ofs;
int err;
err = logfs_erase_segment(sb, area->a_segno, 0);
if (err)
return err;
sh.pad = 0;
sh.type = SEG_OSTORE;
sh.level = (__force u8)area->a_level;
sh.segno = cpu_to_be32(area->a_segno);
sh.ec = cpu_to_be32(area->a_erase_count);
sh.gec = cpu_to_be64(logfs_super(sb)->s_gec);
sh.crc = logfs_crc32(&sh, sizeof(sh), 4);
logfs_set_segment_erased(sb, area->a_segno, area->a_erase_count,
area->a_level);
ofs = dev_ofs(sb, area->a_segno, 0);
area->a_used_bytes = sizeof(sh);
logfs_buf_write(area, ofs, &sh, sizeof(sh));
return 0;
}
static const struct logfs_area_ops ostore_area_ops = {
.get_free_segment = ostore_get_free_segment,
.get_erase_count = ostore_get_erase_count,
.erase_segment = ostore_erase_segment,
};
static void free_area(struct logfs_area *area)
{
if (area)
freeseg(area->a_sb, area->a_segno);
kfree(area);
}
static struct logfs_area *alloc_area(struct super_block *sb)
{
struct logfs_area *area;
area = kzalloc(sizeof(*area), GFP_KERNEL);
if (!area)
return NULL;
area->a_sb = sb;
return area;
}
static void map_invalidatepage(struct page *page, unsigned long l)
{
BUG();
}
static int map_releasepage(struct page *page, gfp_t g)
{
/* Don't release these pages */
return 0;
}
static const struct address_space_operations mapping_aops = {
.invalidatepage = map_invalidatepage,
.releasepage = map_releasepage,
.set_page_dirty = __set_page_dirty_nobuffers,
};
int logfs_init_mapping(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct address_space *mapping;
struct inode *inode;
inode = logfs_new_meta_inode(sb, LOGFS_INO_MAPPING);
if (IS_ERR(inode))
return PTR_ERR(inode);
super->s_mapping_inode = inode;
mapping = inode->i_mapping;
mapping->a_ops = &mapping_aops;
/* Would it be possible to use __GFP_HIGHMEM as well? */
mapping_set_gfp_mask(mapping, GFP_NOFS);
return 0;
}
int logfs_init_areas(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int i = -1;
super->s_alias_pool = mempool_create_kmalloc_pool(600,
sizeof(struct object_alias_item));
if (!super->s_alias_pool)
return -ENOMEM;
super->s_journal_area = alloc_area(sb);
if (!super->s_journal_area)
goto err;
for_each_area(i) {
super->s_area[i] = alloc_area(sb);
if (!super->s_area[i])
goto err;
super->s_area[i]->a_level = GC_LEVEL(i);
super->s_area[i]->a_ops = &ostore_area_ops;
}
btree_init_mempool128(&super->s_object_alias_tree,
super->s_btree_pool);
return 0;
err:
for (i--; i >= 0; i--)
free_area(super->s_area[i]);
free_area(super->s_journal_area);
mempool_destroy(super->s_alias_pool);
return -ENOMEM;
}
void logfs_cleanup_areas(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int i;
btree_grim_visitor128(&super->s_object_alias_tree, 0, kill_alias);
for_each_area(i)
free_area(super->s_area[i]);
free_area(super->s_journal_area);
destroy_meta_inode(super->s_mapping_inode);
}

650
fs/logfs/super.c Normal file
View File

@ -0,0 +1,650 @@
/*
* fs/logfs/super.c
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2005-2008 Joern Engel <joern@logfs.org>
*
* Generally contains mount/umount code and also serves as a dump area for
* any functions that don't fit elsewhere and neither justify a file of their
* own.
*/
#include "logfs.h"
#include <linux/bio.h>
#include <linux/mtd/mtd.h>
#include <linux/statfs.h>
#include <linux/buffer_head.h>
static DEFINE_MUTEX(emergency_mutex);
static struct page *emergency_page;
struct page *emergency_read_begin(struct address_space *mapping, pgoff_t index)
{
filler_t *filler = (filler_t *)mapping->a_ops->readpage;
struct page *page;
int err;
page = read_cache_page(mapping, index, filler, NULL);
if (page)
return page;
/* No more pages available, switch to emergency page */
printk(KERN_INFO"Logfs: Using emergency page\n");
mutex_lock(&emergency_mutex);
err = filler(NULL, emergency_page);
if (err) {
mutex_unlock(&emergency_mutex);
printk(KERN_EMERG"Logfs: Error reading emergency page\n");
return ERR_PTR(err);
}
return emergency_page;
}
void emergency_read_end(struct page *page)
{
if (page == emergency_page)
mutex_unlock(&emergency_mutex);
else
page_cache_release(page);
}
static void dump_segfile(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_segment_entry se;
u32 segno;
for (segno = 0; segno < super->s_no_segs; segno++) {
logfs_get_segment_entry(sb, segno, &se);
printk("%3x: %6x %8x", segno, be32_to_cpu(se.ec_level),
be32_to_cpu(se.valid));
if (++segno < super->s_no_segs) {
logfs_get_segment_entry(sb, segno, &se);
printk(" %6x %8x", be32_to_cpu(se.ec_level),
be32_to_cpu(se.valid));
}
if (++segno < super->s_no_segs) {
logfs_get_segment_entry(sb, segno, &se);
printk(" %6x %8x", be32_to_cpu(se.ec_level),
be32_to_cpu(se.valid));
}
if (++segno < super->s_no_segs) {
logfs_get_segment_entry(sb, segno, &se);
printk(" %6x %8x", be32_to_cpu(se.ec_level),
be32_to_cpu(se.valid));
}
printk("\n");
}
}
/*
* logfs_crash_dump - dump debug information to device
*
* The LogFS superblock only occupies part of a segment. This function will
* write as much debug information as it can gather into the spare space.
*/
void logfs_crash_dump(struct super_block *sb)
{
dump_segfile(sb);
}
/*
* TODO: move to lib/string.c
*/
/**
* memchr_inv - Find a character in an area of memory.
* @s: The memory area
* @c: The byte to search for
* @n: The size of the area.
*
* returns the address of the first character other than @c, or %NULL
* if the whole buffer contains just @c.
*/
void *memchr_inv(const void *s, int c, size_t n)
{
const unsigned char *p = s;
while (n-- != 0)
if ((unsigned char)c != *p++)
return (void *)(p - 1);
return NULL;
}
/*
* FIXME: There should be a reserve for root, similar to ext2.
*/
int logfs_statfs(struct dentry *dentry, struct kstatfs *stats)
{
struct super_block *sb = dentry->d_sb;
struct logfs_super *super = logfs_super(sb);
stats->f_type = LOGFS_MAGIC_U32;
stats->f_bsize = sb->s_blocksize;
stats->f_blocks = super->s_size >> LOGFS_BLOCK_BITS >> 3;
stats->f_bfree = super->s_free_bytes >> sb->s_blocksize_bits;
stats->f_bavail = super->s_free_bytes >> sb->s_blocksize_bits;
stats->f_files = 0;
stats->f_ffree = 0;
stats->f_namelen = LOGFS_MAX_NAMELEN;
return 0;
}
static int logfs_sb_set(struct super_block *sb, void *_super)
{
struct logfs_super *super = _super;
sb->s_fs_info = super;
sb->s_mtd = super->s_mtd;
sb->s_bdev = super->s_bdev;
return 0;
}
static int logfs_sb_test(struct super_block *sb, void *_super)
{
struct logfs_super *super = _super;
struct mtd_info *mtd = super->s_mtd;
if (mtd && sb->s_mtd == mtd)
return 1;
if (super->s_bdev && sb->s_bdev == super->s_bdev)
return 1;
return 0;
}
static void set_segment_header(struct logfs_segment_header *sh, u8 type,
u8 level, u32 segno, u32 ec)
{
sh->pad = 0;
sh->type = type;
sh->level = level;
sh->segno = cpu_to_be32(segno);
sh->ec = cpu_to_be32(ec);
sh->gec = cpu_to_be64(segno);
sh->crc = logfs_crc32(sh, LOGFS_SEGMENT_HEADERSIZE, 4);
}
static void logfs_write_ds(struct super_block *sb, struct logfs_disk_super *ds,
u32 segno, u32 ec)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_segment_header *sh = &ds->ds_sh;
int i;
memset(ds, 0, sizeof(*ds));
set_segment_header(sh, SEG_SUPER, 0, segno, ec);
ds->ds_ifile_levels = super->s_ifile_levels;
ds->ds_iblock_levels = super->s_iblock_levels;
ds->ds_data_levels = super->s_data_levels; /* XXX: Remove */
ds->ds_segment_shift = super->s_segshift;
ds->ds_block_shift = sb->s_blocksize_bits;
ds->ds_write_shift = super->s_writeshift;
ds->ds_filesystem_size = cpu_to_be64(super->s_size);
ds->ds_segment_size = cpu_to_be32(super->s_segsize);
ds->ds_bad_seg_reserve = cpu_to_be32(super->s_bad_seg_reserve);
ds->ds_feature_incompat = cpu_to_be64(super->s_feature_incompat);
ds->ds_feature_ro_compat= cpu_to_be64(super->s_feature_ro_compat);
ds->ds_feature_compat = cpu_to_be64(super->s_feature_compat);
ds->ds_feature_flags = cpu_to_be64(super->s_feature_flags);
ds->ds_root_reserve = cpu_to_be64(super->s_root_reserve);
ds->ds_speed_reserve = cpu_to_be64(super->s_speed_reserve);
journal_for_each(i)
ds->ds_journal_seg[i] = cpu_to_be32(super->s_journal_seg[i]);
ds->ds_magic = cpu_to_be64(LOGFS_MAGIC);
ds->ds_crc = logfs_crc32(ds, sizeof(*ds),
LOGFS_SEGMENT_HEADERSIZE + 12);
}
static int write_one_sb(struct super_block *sb,
struct page *(*find_sb)(struct super_block *sb, u64 *ofs))
{
struct logfs_super *super = logfs_super(sb);
struct logfs_disk_super *ds;
struct logfs_segment_entry se;
struct page *page;
u64 ofs;
u32 ec, segno;
int err;
page = find_sb(sb, &ofs);
if (!page)
return -EIO;
ds = page_address(page);
segno = seg_no(sb, ofs);
logfs_get_segment_entry(sb, segno, &se);
ec = be32_to_cpu(se.ec_level) >> 4;
ec++;
logfs_set_segment_erased(sb, segno, ec, 0);
logfs_write_ds(sb, ds, segno, ec);
err = super->s_devops->write_sb(sb, page);
page_cache_release(page);
return err;
}
int logfs_write_sb(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
int err;
/* First superblock */
err = write_one_sb(sb, super->s_devops->find_first_sb);
if (err)
return err;
/* Last superblock */
err = write_one_sb(sb, super->s_devops->find_last_sb);
if (err)
return err;
return 0;
}
static int ds_cmp(const void *ds0, const void *ds1)
{
size_t len = sizeof(struct logfs_disk_super);
/* We know the segment headers differ, so ignore them */
len -= LOGFS_SEGMENT_HEADERSIZE;
ds0 += LOGFS_SEGMENT_HEADERSIZE;
ds1 += LOGFS_SEGMENT_HEADERSIZE;
return memcmp(ds0, ds1, len);
}
static int logfs_recover_sb(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct logfs_disk_super _ds0, *ds0 = &_ds0;
struct logfs_disk_super _ds1, *ds1 = &_ds1;
int err, valid0, valid1;
/* read first superblock */
err = wbuf_read(sb, super->s_sb_ofs[0], sizeof(*ds0), ds0);
if (err)
return err;
/* read last superblock */
err = wbuf_read(sb, super->s_sb_ofs[1], sizeof(*ds1), ds1);
if (err)
return err;
valid0 = logfs_check_ds(ds0) == 0;
valid1 = logfs_check_ds(ds1) == 0;
if (!valid0 && valid1) {
printk(KERN_INFO"First superblock is invalid - fixing.\n");
return write_one_sb(sb, super->s_devops->find_first_sb);
}
if (valid0 && !valid1) {
printk(KERN_INFO"Last superblock is invalid - fixing.\n");
return write_one_sb(sb, super->s_devops->find_last_sb);
}
if (valid0 && valid1 && ds_cmp(ds0, ds1)) {
printk(KERN_INFO"Superblocks don't match - fixing.\n");
return write_one_sb(sb, super->s_devops->find_last_sb);
}
/* If neither is valid now, something's wrong. Didn't we properly
* check them before?!? */
BUG_ON(!valid0 && !valid1);
return 0;
}
static int logfs_make_writeable(struct super_block *sb)
{
int err;
/* Repair any broken superblock copies */
err = logfs_recover_sb(sb);
if (err)
return err;
/* Check areas for trailing unaccounted data */
err = logfs_check_areas(sb);
if (err)
return err;
err = logfs_open_segfile(sb);
if (err)
return err;
/* Do one GC pass before any data gets dirtied */
logfs_gc_pass(sb);
/* after all initializations are done, replay the journal
* for rw-mounts, if necessary */
err = logfs_replay_journal(sb);
if (err)
return err;
return 0;
}
static int logfs_get_sb_final(struct super_block *sb, struct vfsmount *mnt)
{
struct logfs_super *super = logfs_super(sb);
struct inode *rootdir;
int err;
/* root dir */
rootdir = logfs_iget(sb, LOGFS_INO_ROOT);
if (IS_ERR(rootdir))
goto fail;
sb->s_root = d_alloc_root(rootdir);
if (!sb->s_root)
goto fail;
super->s_erase_page = alloc_pages(GFP_KERNEL, 0);
if (!super->s_erase_page)
goto fail2;
memset(page_address(super->s_erase_page), 0xFF, PAGE_SIZE);
/* FIXME: check for read-only mounts */
err = logfs_make_writeable(sb);
if (err)
goto fail3;
log_super("LogFS: Finished mounting\n");
simple_set_mnt(mnt, sb);
return 0;
fail3:
__free_page(super->s_erase_page);
fail2:
iput(rootdir);
fail:
iput(logfs_super(sb)->s_master_inode);
return -EIO;
}
int logfs_check_ds(struct logfs_disk_super *ds)
{
struct logfs_segment_header *sh = &ds->ds_sh;
if (ds->ds_magic != cpu_to_be64(LOGFS_MAGIC))
return -EINVAL;
if (sh->crc != logfs_crc32(sh, LOGFS_SEGMENT_HEADERSIZE, 4))
return -EINVAL;
if (ds->ds_crc != logfs_crc32(ds, sizeof(*ds),
LOGFS_SEGMENT_HEADERSIZE + 12))
return -EINVAL;
return 0;
}
static struct page *find_super_block(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct page *first, *last;
first = super->s_devops->find_first_sb(sb, &super->s_sb_ofs[0]);
if (!first || IS_ERR(first))
return NULL;
last = super->s_devops->find_last_sb(sb, &super->s_sb_ofs[1]);
if (!last || IS_ERR(first)) {
page_cache_release(first);
return NULL;
}
if (!logfs_check_ds(page_address(first))) {
page_cache_release(last);
return first;
}
/* First one didn't work, try the second superblock */
if (!logfs_check_ds(page_address(last))) {
page_cache_release(first);
return last;
}
/* Neither worked, sorry folks */
page_cache_release(first);
page_cache_release(last);
return NULL;
}
static int __logfs_read_sb(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
struct page *page;
struct logfs_disk_super *ds;
int i;
page = find_super_block(sb);
if (!page)
return -EIO;
ds = page_address(page);
super->s_size = be64_to_cpu(ds->ds_filesystem_size);
super->s_root_reserve = be64_to_cpu(ds->ds_root_reserve);
super->s_speed_reserve = be64_to_cpu(ds->ds_speed_reserve);
super->s_bad_seg_reserve = be32_to_cpu(ds->ds_bad_seg_reserve);
super->s_segsize = 1 << ds->ds_segment_shift;
super->s_segmask = (1 << ds->ds_segment_shift) - 1;
super->s_segshift = ds->ds_segment_shift;
sb->s_blocksize = 1 << ds->ds_block_shift;
sb->s_blocksize_bits = ds->ds_block_shift;
super->s_writesize = 1 << ds->ds_write_shift;
super->s_writeshift = ds->ds_write_shift;
super->s_no_segs = super->s_size >> super->s_segshift;
super->s_no_blocks = super->s_segsize >> sb->s_blocksize_bits;
super->s_feature_incompat = be64_to_cpu(ds->ds_feature_incompat);
super->s_feature_ro_compat = be64_to_cpu(ds->ds_feature_ro_compat);
super->s_feature_compat = be64_to_cpu(ds->ds_feature_compat);
super->s_feature_flags = be64_to_cpu(ds->ds_feature_flags);
journal_for_each(i)
super->s_journal_seg[i] = be32_to_cpu(ds->ds_journal_seg[i]);
super->s_ifile_levels = ds->ds_ifile_levels;
super->s_iblock_levels = ds->ds_iblock_levels;
super->s_data_levels = ds->ds_data_levels;
super->s_total_levels = super->s_ifile_levels + super->s_iblock_levels
+ super->s_data_levels;
page_cache_release(page);
return 0;
}
static int logfs_read_sb(struct super_block *sb, int read_only)
{
struct logfs_super *super = logfs_super(sb);
int ret;
super->s_btree_pool = mempool_create(32, btree_alloc, btree_free, NULL);
if (!super->s_btree_pool)
return -ENOMEM;
btree_init_mempool64(&super->s_shadow_tree.new, super->s_btree_pool);
btree_init_mempool64(&super->s_shadow_tree.old, super->s_btree_pool);
ret = logfs_init_mapping(sb);
if (ret)
return ret;
ret = __logfs_read_sb(sb);
if (ret)
return ret;
if (super->s_feature_incompat & ~LOGFS_FEATURES_INCOMPAT)
return -EIO;
if ((super->s_feature_ro_compat & ~LOGFS_FEATURES_RO_COMPAT) &&
!read_only)
return -EIO;
mutex_init(&super->s_dirop_mutex);
mutex_init(&super->s_object_alias_mutex);
INIT_LIST_HEAD(&super->s_freeing_list);
ret = logfs_init_rw(sb);
if (ret)
return ret;
ret = logfs_init_areas(sb);
if (ret)
return ret;
ret = logfs_init_gc(sb);
if (ret)
return ret;
ret = logfs_init_journal(sb);
if (ret)
return ret;
return 0;
}
static void logfs_kill_sb(struct super_block *sb)
{
struct logfs_super *super = logfs_super(sb);
log_super("LogFS: Start unmounting\n");
/* Alias entries slow down mount, so evict as many as possible */
sync_filesystem(sb);
logfs_write_anchor(sb);
/*
* From this point on alias entries are simply dropped - and any
* writes to the object store are considered bugs.
*/
super->s_flags |= LOGFS_SB_FLAG_SHUTDOWN;
log_super("LogFS: Now in shutdown\n");
generic_shutdown_super(sb);
BUG_ON(super->s_dirty_used_bytes || super->s_dirty_free_bytes);
logfs_cleanup_gc(sb);
logfs_cleanup_journal(sb);
logfs_cleanup_areas(sb);
logfs_cleanup_rw(sb);
if (super->s_erase_page)
__free_page(super->s_erase_page);
super->s_devops->put_device(sb);
mempool_destroy(super->s_btree_pool);
mempool_destroy(super->s_alias_pool);
kfree(super);
log_super("LogFS: Finished unmounting\n");
}
int logfs_get_sb_device(struct file_system_type *type, int flags,
struct mtd_info *mtd, struct block_device *bdev,
const struct logfs_device_ops *devops, struct vfsmount *mnt)
{
struct logfs_super *super;
struct super_block *sb;
int err = -ENOMEM;
static int mount_count;
log_super("LogFS: Start mount %x\n", mount_count++);
super = kzalloc(sizeof(*super), GFP_KERNEL);
if (!super)
goto err0;
super->s_mtd = mtd;
super->s_bdev = bdev;
err = -EINVAL;
sb = sget(type, logfs_sb_test, logfs_sb_set, super);
if (IS_ERR(sb))
goto err0;
if (sb->s_root) {
/* Device is already in use */
err = 0;
simple_set_mnt(mnt, sb);
goto err0;
}
super->s_devops = devops;
/*
* sb->s_maxbytes is limited to 8TB. On 32bit systems, the page cache
* only covers 16TB and the upper 8TB are used for indirect blocks.
* On 64bit system we could bump up the limit, but that would make
* the filesystem incompatible with 32bit systems.
*/
sb->s_maxbytes = (1ull << 43) - 1;
sb->s_op = &logfs_super_operations;
sb->s_flags = flags | MS_NOATIME;
err = logfs_read_sb(sb, sb->s_flags & MS_RDONLY);
if (err)
goto err1;
sb->s_flags |= MS_ACTIVE;
err = logfs_get_sb_final(sb, mnt);
if (err)
goto err1;
return 0;
err1:
up_write(&sb->s_umount);
deactivate_super(sb);
return err;
err0:
kfree(super);
//devops->put_device(sb);
return err;
}
static int logfs_get_sb(struct file_system_type *type, int flags,
const char *devname, void *data, struct vfsmount *mnt)
{
ulong mtdnr;
if (!devname)
return logfs_get_sb_bdev(type, flags, devname, mnt);
if (strncmp(devname, "mtd", 3))
return logfs_get_sb_bdev(type, flags, devname, mnt);
{
char *garbage;
mtdnr = simple_strtoul(devname+3, &garbage, 0);
if (*garbage)
return -EINVAL;
}
return logfs_get_sb_mtd(type, flags, mtdnr, mnt);
}
static struct file_system_type logfs_fs_type = {
.owner = THIS_MODULE,
.name = "logfs",
.get_sb = logfs_get_sb,
.kill_sb = logfs_kill_sb,
.fs_flags = FS_REQUIRES_DEV,
};
static int __init logfs_init(void)
{
int ret;
emergency_page = alloc_pages(GFP_KERNEL, 0);
if (!emergency_page)
return -ENOMEM;
ret = logfs_compr_init();
if (ret)
goto out1;
ret = logfs_init_inode_cache();
if (ret)
goto out2;
return register_filesystem(&logfs_fs_type);
out2:
logfs_compr_exit();
out1:
__free_pages(emergency_page, 0);
return ret;
}
static void __exit logfs_exit(void)
{
unregister_filesystem(&logfs_fs_type);
logfs_destroy_inode_cache();
logfs_compr_exit();
__free_pages(emergency_page, 0);
}
module_init(logfs_init);
module_exit(logfs_exit);
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Joern Engel <joern@logfs.org>");
MODULE_DESCRIPTION("scalable flash filesystem");

109
include/linux/btree-128.h Normal file
View File

@ -0,0 +1,109 @@
extern struct btree_geo btree_geo128;
struct btree_head128 { struct btree_head h; };
static inline void btree_init_mempool128(struct btree_head128 *head,
mempool_t *mempool)
{
btree_init_mempool(&head->h, mempool);
}
static inline int btree_init128(struct btree_head128 *head)
{
return btree_init(&head->h);
}
static inline void btree_destroy128(struct btree_head128 *head)
{
btree_destroy(&head->h);
}
static inline void *btree_lookup128(struct btree_head128 *head, u64 k1, u64 k2)
{
u64 key[2] = {k1, k2};
return btree_lookup(&head->h, &btree_geo128, (unsigned long *)&key);
}
static inline void *btree_get_prev128(struct btree_head128 *head,
u64 *k1, u64 *k2)
{
u64 key[2] = {*k1, *k2};
void *val;
val = btree_get_prev(&head->h, &btree_geo128,
(unsigned long *)&key);
*k1 = key[0];
*k2 = key[1];
return val;
}
static inline int btree_insert128(struct btree_head128 *head, u64 k1, u64 k2,
void *val, gfp_t gfp)
{
u64 key[2] = {k1, k2};
return btree_insert(&head->h, &btree_geo128,
(unsigned long *)&key, val, gfp);
}
static inline int btree_update128(struct btree_head128 *head, u64 k1, u64 k2,
void *val)
{
u64 key[2] = {k1, k2};
return btree_update(&head->h, &btree_geo128,
(unsigned long *)&key, val);
}
static inline void *btree_remove128(struct btree_head128 *head, u64 k1, u64 k2)
{
u64 key[2] = {k1, k2};
return btree_remove(&head->h, &btree_geo128, (unsigned long *)&key);
}
static inline void *btree_last128(struct btree_head128 *head, u64 *k1, u64 *k2)
{
u64 key[2];
void *val;
val = btree_last(&head->h, &btree_geo128, (unsigned long *)&key[0]);
if (val) {
*k1 = key[0];
*k2 = key[1];
}
return val;
}
static inline int btree_merge128(struct btree_head128 *target,
struct btree_head128 *victim,
gfp_t gfp)
{
return btree_merge(&target->h, &victim->h, &btree_geo128, gfp);
}
void visitor128(void *elem, unsigned long opaque, unsigned long *__key,
size_t index, void *__func);
typedef void (*visitor128_t)(void *elem, unsigned long opaque,
u64 key1, u64 key2, size_t index);
static inline size_t btree_visitor128(struct btree_head128 *head,
unsigned long opaque,
visitor128_t func2)
{
return btree_visitor(&head->h, &btree_geo128, opaque,
visitor128, func2);
}
static inline size_t btree_grim_visitor128(struct btree_head128 *head,
unsigned long opaque,
visitor128_t func2)
{
return btree_grim_visitor(&head->h, &btree_geo128, opaque,
visitor128, func2);
}
#define btree_for_each_safe128(head, k1, k2, val) \
for (val = btree_last128(head, &k1, &k2); \
val; \
val = btree_get_prev128(head, &k1, &k2))

147
include/linux/btree-type.h Normal file
View File

@ -0,0 +1,147 @@
#define __BTREE_TP(pfx, type, sfx) pfx ## type ## sfx
#define _BTREE_TP(pfx, type, sfx) __BTREE_TP(pfx, type, sfx)
#define BTREE_TP(pfx) _BTREE_TP(pfx, BTREE_TYPE_SUFFIX,)
#define BTREE_FN(name) BTREE_TP(btree_ ## name)
#define BTREE_TYPE_HEAD BTREE_TP(struct btree_head)
#define VISITOR_FN BTREE_TP(visitor)
#define VISITOR_FN_T _BTREE_TP(visitor, BTREE_TYPE_SUFFIX, _t)
BTREE_TYPE_HEAD {
struct btree_head h;
};
static inline void BTREE_FN(init_mempool)(BTREE_TYPE_HEAD *head,
mempool_t *mempool)
{
btree_init_mempool(&head->h, mempool);
}
static inline int BTREE_FN(init)(BTREE_TYPE_HEAD *head)
{
return btree_init(&head->h);
}
static inline void BTREE_FN(destroy)(BTREE_TYPE_HEAD *head)
{
btree_destroy(&head->h);
}
static inline int BTREE_FN(merge)(BTREE_TYPE_HEAD *target,
BTREE_TYPE_HEAD *victim,
gfp_t gfp)
{
return btree_merge(&target->h, &victim->h, BTREE_TYPE_GEO, gfp);
}
#if (BITS_PER_LONG > BTREE_TYPE_BITS)
static inline void *BTREE_FN(lookup)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
{
unsigned long _key = key;
return btree_lookup(&head->h, BTREE_TYPE_GEO, &_key);
}
static inline int BTREE_FN(insert)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
void *val, gfp_t gfp)
{
unsigned long _key = key;
return btree_insert(&head->h, BTREE_TYPE_GEO, &_key, val, gfp);
}
static inline int BTREE_FN(update)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
void *val)
{
unsigned long _key = key;
return btree_update(&head->h, BTREE_TYPE_GEO, &_key, val);
}
static inline void *BTREE_FN(remove)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
{
unsigned long _key = key;
return btree_remove(&head->h, BTREE_TYPE_GEO, &_key);
}
static inline void *BTREE_FN(last)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
{
unsigned long _key;
void *val = btree_last(&head->h, BTREE_TYPE_GEO, &_key);
if (val)
*key = _key;
return val;
}
static inline void *BTREE_FN(get_prev)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
{
unsigned long _key = *key;
void *val = btree_get_prev(&head->h, BTREE_TYPE_GEO, &_key);
if (val)
*key = _key;
return val;
}
#else
static inline void *BTREE_FN(lookup)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
{
return btree_lookup(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key);
}
static inline int BTREE_FN(insert)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
void *val, gfp_t gfp)
{
return btree_insert(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key,
val, gfp);
}
static inline int BTREE_FN(update)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
void *val)
{
return btree_update(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key, val);
}
static inline void *BTREE_FN(remove)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
{
return btree_remove(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key);
}
static inline void *BTREE_FN(last)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
{
return btree_last(&head->h, BTREE_TYPE_GEO, (unsigned long *)key);
}
static inline void *BTREE_FN(get_prev)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
{
return btree_get_prev(&head->h, BTREE_TYPE_GEO, (unsigned long *)key);
}
#endif
void VISITOR_FN(void *elem, unsigned long opaque, unsigned long *key,
size_t index, void *__func);
typedef void (*VISITOR_FN_T)(void *elem, unsigned long opaque,
BTREE_KEYTYPE key, size_t index);
static inline size_t BTREE_FN(visitor)(BTREE_TYPE_HEAD *head,
unsigned long opaque,
VISITOR_FN_T func2)
{
return btree_visitor(&head->h, BTREE_TYPE_GEO, opaque,
visitorl, func2);
}
static inline size_t BTREE_FN(grim_visitor)(BTREE_TYPE_HEAD *head,
unsigned long opaque,
VISITOR_FN_T func2)
{
return btree_grim_visitor(&head->h, BTREE_TYPE_GEO, opaque,
visitorl, func2);
}
#undef VISITOR_FN
#undef VISITOR_FN_T
#undef __BTREE_TP
#undef _BTREE_TP
#undef BTREE_TP
#undef BTREE_FN
#undef BTREE_TYPE_HEAD
#undef BTREE_TYPE_SUFFIX
#undef BTREE_TYPE_GEO
#undef BTREE_KEYTYPE
#undef BTREE_TYPE_BITS

243
include/linux/btree.h Normal file
View File

@ -0,0 +1,243 @@
#ifndef BTREE_H
#define BTREE_H
#include <linux/kernel.h>
#include <linux/mempool.h>
/**
* DOC: B+Tree basics
*
* A B+Tree is a data structure for looking up arbitrary (currently allowing
* unsigned long, u32, u64 and 2 * u64) keys into pointers. The data structure
* is described at http://en.wikipedia.org/wiki/B-tree, we currently do not
* use binary search to find the key on lookups.
*
* Each B+Tree consists of a head, that contains bookkeeping information and
* a variable number (starting with zero) nodes. Each node contains the keys
* and pointers to sub-nodes, or, for leaf nodes, the keys and values for the
* tree entries.
*
* Each node in this implementation has the following layout:
* [key1, key2, ..., keyN] [val1, val2, ..., valN]
*
* Each key here is an array of unsigned longs, geo->no_longs in total. The
* number of keys and values (N) is geo->no_pairs.
*/
/**
* struct btree_head - btree head
*
* @node: the first node in the tree
* @mempool: mempool used for node allocations
* @height: current of the tree
*/
struct btree_head {
unsigned long *node;
mempool_t *mempool;
int height;
};
/* btree geometry */
struct btree_geo;
/**
* btree_alloc - allocate function for the mempool
* @gfp_mask: gfp mask for the allocation
* @pool_data: unused
*/
void *btree_alloc(gfp_t gfp_mask, void *pool_data);
/**
* btree_free - free function for the mempool
* @element: the element to free
* @pool_data: unused
*/
void btree_free(void *element, void *pool_data);
/**
* btree_init_mempool - initialise a btree with given mempool
*
* @head: the btree head to initialise
* @mempool: the mempool to use
*
* When this function is used, there is no need to destroy
* the mempool.
*/
void btree_init_mempool(struct btree_head *head, mempool_t *mempool);
/**
* btree_init - initialise a btree
*
* @head: the btree head to initialise
*
* This function allocates the memory pool that the
* btree needs. Returns zero or a negative error code
* (-%ENOMEM) when memory allocation fails.
*
*/
int __must_check btree_init(struct btree_head *head);
/**
* btree_destroy - destroy mempool
*
* @head: the btree head to destroy
*
* This function destroys the internal memory pool, use only
* when using btree_init(), not with btree_init_mempool().
*/
void btree_destroy(struct btree_head *head);
/**
* btree_lookup - look up a key in the btree
*
* @head: the btree to look in
* @geo: the btree geometry
* @key: the key to look up
*
* This function returns the value for the given key, or %NULL.
*/
void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
unsigned long *key);
/**
* btree_insert - insert an entry into the btree
*
* @head: the btree to add to
* @geo: the btree geometry
* @key: the key to add (must not already be present)
* @val: the value to add (must not be %NULL)
* @gfp: allocation flags for node allocations
*
* This function returns 0 if the item could be added, or an
* error code if it failed (may fail due to memory pressure).
*/
int __must_check btree_insert(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, void *val, gfp_t gfp);
/**
* btree_update - update an entry in the btree
*
* @head: the btree to update
* @geo: the btree geometry
* @key: the key to update
* @val: the value to change it to (must not be %NULL)
*
* This function returns 0 if the update was successful, or
* -%ENOENT if the key could not be found.
*/
int btree_update(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, void *val);
/**
* btree_remove - remove an entry from the btree
*
* @head: the btree to update
* @geo: the btree geometry
* @key: the key to remove
*
* This function returns the removed entry, or %NULL if the key
* could not be found.
*/
void *btree_remove(struct btree_head *head, struct btree_geo *geo,
unsigned long *key);
/**
* btree_merge - merge two btrees
*
* @target: the tree that gets all the entries
* @victim: the tree that gets merged into @target
* @geo: the btree geometry
* @gfp: allocation flags
*
* The two trees @target and @victim may not contain the same keys,
* that is a bug and triggers a BUG(). This function returns zero
* if the trees were merged successfully, and may return a failure
* when memory allocation fails, in which case both trees might have
* been partially merged, i.e. some entries have been moved from
* @victim to @target.
*/
int btree_merge(struct btree_head *target, struct btree_head *victim,
struct btree_geo *geo, gfp_t gfp);
/**
* btree_last - get last entry in btree
*
* @head: btree head
* @geo: btree geometry
* @key: last key
*
* Returns the last entry in the btree, and sets @key to the key
* of that entry; returns NULL if the tree is empty, in that case
* key is not changed.
*/
void *btree_last(struct btree_head *head, struct btree_geo *geo,
unsigned long *key);
/**
* btree_get_prev - get previous entry
*
* @head: btree head
* @geo: btree geometry
* @key: pointer to key
*
* The function returns the next item right before the value pointed to by
* @key, and updates @key with its key, or returns %NULL when there is no
* entry with a key smaller than the given key.
*/
void *btree_get_prev(struct btree_head *head, struct btree_geo *geo,
unsigned long *key);
/* internal use, use btree_visitor{l,32,64,128} */
size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
unsigned long opaque,
void (*func)(void *elem, unsigned long opaque,
unsigned long *key, size_t index,
void *func2),
void *func2);
/* internal use, use btree_grim_visitor{l,32,64,128} */
size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
unsigned long opaque,
void (*func)(void *elem, unsigned long opaque,
unsigned long *key,
size_t index, void *func2),
void *func2);
#include <linux/btree-128.h>
extern struct btree_geo btree_geo32;
#define BTREE_TYPE_SUFFIX l
#define BTREE_TYPE_BITS BITS_PER_LONG
#define BTREE_TYPE_GEO &btree_geo32
#define BTREE_KEYTYPE unsigned long
#include <linux/btree-type.h>
#define btree_for_each_safel(head, key, val) \
for (val = btree_lastl(head, &key); \
val; \
val = btree_get_prevl(head, &key))
#define BTREE_TYPE_SUFFIX 32
#define BTREE_TYPE_BITS 32
#define BTREE_TYPE_GEO &btree_geo32
#define BTREE_KEYTYPE u32
#include <linux/btree-type.h>
#define btree_for_each_safe32(head, key, val) \
for (val = btree_last32(head, &key); \
val; \
val = btree_get_prev32(head, &key))
extern struct btree_geo btree_geo64;
#define BTREE_TYPE_SUFFIX 64
#define BTREE_TYPE_BITS 64
#define BTREE_TYPE_GEO &btree_geo64
#define BTREE_KEYTYPE u64
#include <linux/btree-type.h>
#define btree_for_each_safe64(head, key, val) \
for (val = btree_last64(head, &key); \
val; \
val = btree_get_prev64(head, &key))
#endif

View File

@ -163,6 +163,9 @@ config TEXTSEARCH_FSM
config LIST_SORT
boolean
config BTREE
boolean
config HAS_IOMEM
boolean
depends on !NO_IOMEM

View File

@ -42,6 +42,7 @@ obj-$(CONFIG_GENERIC_FIND_LAST_BIT) += find_last_bit.o
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
obj-$(CONFIG_LIST_SORT) += list_sort.o
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_BTREE) += btree.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o

797
lib/btree.c Normal file
View File

@ -0,0 +1,797 @@
/*
* lib/btree.c - Simple In-memory B+Tree
*
* As should be obvious for Linux kernel code, license is GPLv2
*
* Copyright (c) 2007-2008 Joern Engel <joern@logfs.org>
* Bits and pieces stolen from Peter Zijlstra's code, which is
* Copyright 2007, Red Hat Inc. Peter Zijlstra <pzijlstr@redhat.com>
* GPLv2
*
* see http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch
*
* A relatively simple B+Tree implementation. I have written it as a learning
* excercise to understand how B+Trees work. Turned out to be useful as well.
*
* B+Trees can be used similar to Linux radix trees (which don't have anything
* in common with textbook radix trees, beware). Prerequisite for them working
* well is that access to a random tree node is much faster than a large number
* of operations within each node.
*
* Disks have fulfilled the prerequisite for a long time. More recently DRAM
* has gained similar properties, as memory access times, when measured in cpu
* cycles, have increased. Cacheline sizes have increased as well, which also
* helps B+Trees.
*
* Compared to radix trees, B+Trees are more efficient when dealing with a
* sparsely populated address space. Between 25% and 50% of the memory is
* occupied with valid pointers. When densely populated, radix trees contain
* ~98% pointers - hard to beat. Very sparse radix trees contain only ~2%
* pointers.
*
* This particular implementation stores pointers identified by a long value.
* Storing NULL pointers is illegal, lookup will return NULL when no entry
* was found.
*
* A tricks was used that is not commonly found in textbooks. The lowest
* values are to the right, not to the left. All used slots within a node
* are on the left, all unused slots contain NUL values. Most operations
* simply loop once over all slots and terminate on the first NUL.
*/
#include <linux/btree.h>
#include <linux/cache.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/module.h>
#define MAX(a, b) ((a) > (b) ? (a) : (b))
#define NODESIZE MAX(L1_CACHE_BYTES, 128)
struct btree_geo {
int keylen;
int no_pairs;
int no_longs;
};
struct btree_geo btree_geo32 = {
.keylen = 1,
.no_pairs = NODESIZE / sizeof(long) / 2,
.no_longs = NODESIZE / sizeof(long) / 2,
};
EXPORT_SYMBOL_GPL(btree_geo32);
#define LONG_PER_U64 (64 / BITS_PER_LONG)
struct btree_geo btree_geo64 = {
.keylen = LONG_PER_U64,
.no_pairs = NODESIZE / sizeof(long) / (1 + LONG_PER_U64),
.no_longs = LONG_PER_U64 * (NODESIZE / sizeof(long) / (1 + LONG_PER_U64)),
};
EXPORT_SYMBOL_GPL(btree_geo64);
struct btree_geo btree_geo128 = {
.keylen = 2 * LONG_PER_U64,
.no_pairs = NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64),
.no_longs = 2 * LONG_PER_U64 * (NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64)),
};
EXPORT_SYMBOL_GPL(btree_geo128);
static struct kmem_cache *btree_cachep;
void *btree_alloc(gfp_t gfp_mask, void *pool_data)
{
return kmem_cache_alloc(btree_cachep, gfp_mask);
}
EXPORT_SYMBOL_GPL(btree_alloc);
void btree_free(void *element, void *pool_data)
{
kmem_cache_free(btree_cachep, element);
}
EXPORT_SYMBOL_GPL(btree_free);
static unsigned long *btree_node_alloc(struct btree_head *head, gfp_t gfp)
{
unsigned long *node;
node = mempool_alloc(head->mempool, gfp);
memset(node, 0, NODESIZE);
return node;
}
static int longcmp(const unsigned long *l1, const unsigned long *l2, size_t n)
{
size_t i;
for (i = 0; i < n; i++) {
if (l1[i] < l2[i])
return -1;
if (l1[i] > l2[i])
return 1;
}
return 0;
}
static unsigned long *longcpy(unsigned long *dest, const unsigned long *src,
size_t n)
{
size_t i;
for (i = 0; i < n; i++)
dest[i] = src[i];
return dest;
}
static unsigned long *longset(unsigned long *s, unsigned long c, size_t n)
{
size_t i;
for (i = 0; i < n; i++)
s[i] = c;
return s;
}
static void dec_key(struct btree_geo *geo, unsigned long *key)
{
unsigned long val;
int i;
for (i = geo->keylen - 1; i >= 0; i--) {
val = key[i];
key[i] = val - 1;
if (val)
break;
}
}
static unsigned long *bkey(struct btree_geo *geo, unsigned long *node, int n)
{
return &node[n * geo->keylen];
}
static void *bval(struct btree_geo *geo, unsigned long *node, int n)
{
return (void *)node[geo->no_longs + n];
}
static void setkey(struct btree_geo *geo, unsigned long *node, int n,
unsigned long *key)
{
longcpy(bkey(geo, node, n), key, geo->keylen);
}
static void setval(struct btree_geo *geo, unsigned long *node, int n,
void *val)
{
node[geo->no_longs + n] = (unsigned long) val;
}
static void clearpair(struct btree_geo *geo, unsigned long *node, int n)
{
longset(bkey(geo, node, n), 0, geo->keylen);
node[geo->no_longs + n] = 0;
}
static inline void __btree_init(struct btree_head *head)
{
head->node = NULL;
head->height = 0;
}
void btree_init_mempool(struct btree_head *head, mempool_t *mempool)
{
__btree_init(head);
head->mempool = mempool;
}
EXPORT_SYMBOL_GPL(btree_init_mempool);
int btree_init(struct btree_head *head)
{
__btree_init(head);
head->mempool = mempool_create(0, btree_alloc, btree_free, NULL);
if (!head->mempool)
return -ENOMEM;
return 0;
}
EXPORT_SYMBOL_GPL(btree_init);
void btree_destroy(struct btree_head *head)
{
mempool_destroy(head->mempool);
head->mempool = NULL;
}
EXPORT_SYMBOL_GPL(btree_destroy);
void *btree_last(struct btree_head *head, struct btree_geo *geo,
unsigned long *key)
{
int height = head->height;
unsigned long *node = head->node;
if (height == 0)
return NULL;
for ( ; height > 1; height--)
node = bval(geo, node, 0);
longcpy(key, bkey(geo, node, 0), geo->keylen);
return bval(geo, node, 0);
}
EXPORT_SYMBOL_GPL(btree_last);
static int keycmp(struct btree_geo *geo, unsigned long *node, int pos,
unsigned long *key)
{
return longcmp(bkey(geo, node, pos), key, geo->keylen);
}
static int keyzero(struct btree_geo *geo, unsigned long *key)
{
int i;
for (i = 0; i < geo->keylen; i++)
if (key[i])
return 0;
return 1;
}
void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
unsigned long *key)
{
int i, height = head->height;
unsigned long *node = head->node;
if (height == 0)
return NULL;
for ( ; height > 1; height--) {
for (i = 0; i < geo->no_pairs; i++)
if (keycmp(geo, node, i, key) <= 0)
break;
if (i == geo->no_pairs)
return NULL;
node = bval(geo, node, i);
if (!node)
return NULL;
}
if (!node)
return NULL;
for (i = 0; i < geo->no_pairs; i++)
if (keycmp(geo, node, i, key) == 0)
return bval(geo, node, i);
return NULL;
}
EXPORT_SYMBOL_GPL(btree_lookup);
int btree_update(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, void *val)
{
int i, height = head->height;
unsigned long *node = head->node;
if (height == 0)
return -ENOENT;
for ( ; height > 1; height--) {
for (i = 0; i < geo->no_pairs; i++)
if (keycmp(geo, node, i, key) <= 0)
break;
if (i == geo->no_pairs)
return -ENOENT;
node = bval(geo, node, i);
if (!node)
return -ENOENT;
}
if (!node)
return -ENOENT;
for (i = 0; i < geo->no_pairs; i++)
if (keycmp(geo, node, i, key) == 0) {
setval(geo, node, i, val);
return 0;
}
return -ENOENT;
}
EXPORT_SYMBOL_GPL(btree_update);
/*
* Usually this function is quite similar to normal lookup. But the key of
* a parent node may be smaller than the smallest key of all its siblings.
* In such a case we cannot just return NULL, as we have only proven that no
* key smaller than __key, but larger than this parent key exists.
* So we set __key to the parent key and retry. We have to use the smallest
* such parent key, which is the last parent key we encountered.
*/
void *btree_get_prev(struct btree_head *head, struct btree_geo *geo,
unsigned long *__key)
{
int i, height;
unsigned long *node, *oldnode;
unsigned long *retry_key = NULL, key[geo->keylen];
if (keyzero(geo, __key))
return NULL;
if (head->height == 0)
return NULL;
retry:
longcpy(key, __key, geo->keylen);
dec_key(geo, key);
node = head->node;
for (height = head->height ; height > 1; height--) {
for (i = 0; i < geo->no_pairs; i++)
if (keycmp(geo, node, i, key) <= 0)
break;
if (i == geo->no_pairs)
goto miss;
oldnode = node;
node = bval(geo, node, i);
if (!node)
goto miss;
retry_key = bkey(geo, oldnode, i);
}
if (!node)
goto miss;
for (i = 0; i < geo->no_pairs; i++) {
if (keycmp(geo, node, i, key) <= 0) {
if (bval(geo, node, i)) {
longcpy(__key, bkey(geo, node, i), geo->keylen);
return bval(geo, node, i);
} else
goto miss;
}
}
miss:
if (retry_key) {
__key = retry_key;
retry_key = NULL;
goto retry;
}
return NULL;
}
static int getpos(struct btree_geo *geo, unsigned long *node,
unsigned long *key)
{
int i;
for (i = 0; i < geo->no_pairs; i++) {
if (keycmp(geo, node, i, key) <= 0)
break;
}
return i;
}
static int getfill(struct btree_geo *geo, unsigned long *node, int start)
{
int i;
for (i = start; i < geo->no_pairs; i++)
if (!bval(geo, node, i))
break;
return i;
}
/*
* locate the correct leaf node in the btree
*/
static unsigned long *find_level(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, int level)
{
unsigned long *node = head->node;
int i, height;
for (height = head->height; height > level; height--) {
for (i = 0; i < geo->no_pairs; i++)
if (keycmp(geo, node, i, key) <= 0)
break;
if ((i == geo->no_pairs) || !bval(geo, node, i)) {
/* right-most key is too large, update it */
/* FIXME: If the right-most key on higher levels is
* always zero, this wouldn't be necessary. */
i--;
setkey(geo, node, i, key);
}
BUG_ON(i < 0);
node = bval(geo, node, i);
}
BUG_ON(!node);
return node;
}
static int btree_grow(struct btree_head *head, struct btree_geo *geo,
gfp_t gfp)
{
unsigned long *node;
int fill;
node = btree_node_alloc(head, gfp);
if (!node)
return -ENOMEM;
if (head->node) {
fill = getfill(geo, head->node, 0);
setkey(geo, node, 0, bkey(geo, head->node, fill - 1));
setval(geo, node, 0, head->node);
}
head->node = node;
head->height++;
return 0;
}
static void btree_shrink(struct btree_head *head, struct btree_geo *geo)
{
unsigned long *node;
int fill;
if (head->height <= 1)
return;
node = head->node;
fill = getfill(geo, node, 0);
BUG_ON(fill > 1);
head->node = bval(geo, node, 0);
head->height--;
mempool_free(node, head->mempool);
}
static int btree_insert_level(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, void *val, int level,
gfp_t gfp)
{
unsigned long *node;
int i, pos, fill, err;
BUG_ON(!val);
if (head->height < level) {
err = btree_grow(head, geo, gfp);
if (err)
return err;
}
retry:
node = find_level(head, geo, key, level);
pos = getpos(geo, node, key);
fill = getfill(geo, node, pos);
/* two identical keys are not allowed */
BUG_ON(pos < fill && keycmp(geo, node, pos, key) == 0);
if (fill == geo->no_pairs) {
/* need to split node */
unsigned long *new;
new = btree_node_alloc(head, gfp);
if (!new)
return -ENOMEM;
err = btree_insert_level(head, geo,
bkey(geo, node, fill / 2 - 1),
new, level + 1, gfp);
if (err) {
mempool_free(new, head->mempool);
return err;
}
for (i = 0; i < fill / 2; i++) {
setkey(geo, new, i, bkey(geo, node, i));
setval(geo, new, i, bval(geo, node, i));
setkey(geo, node, i, bkey(geo, node, i + fill / 2));
setval(geo, node, i, bval(geo, node, i + fill / 2));
clearpair(geo, node, i + fill / 2);
}
if (fill & 1) {
setkey(geo, node, i, bkey(geo, node, fill - 1));
setval(geo, node, i, bval(geo, node, fill - 1));
clearpair(geo, node, fill - 1);
}
goto retry;
}
BUG_ON(fill >= geo->no_pairs);
/* shift and insert */
for (i = fill; i > pos; i--) {
setkey(geo, node, i, bkey(geo, node, i - 1));
setval(geo, node, i, bval(geo, node, i - 1));
}
setkey(geo, node, pos, key);
setval(geo, node, pos, val);
return 0;
}
int btree_insert(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, void *val, gfp_t gfp)
{
return btree_insert_level(head, geo, key, val, 1, gfp);
}
EXPORT_SYMBOL_GPL(btree_insert);
static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, int level);
static void merge(struct btree_head *head, struct btree_geo *geo, int level,
unsigned long *left, int lfill,
unsigned long *right, int rfill,
unsigned long *parent, int lpos)
{
int i;
for (i = 0; i < rfill; i++) {
/* Move all keys to the left */
setkey(geo, left, lfill + i, bkey(geo, right, i));
setval(geo, left, lfill + i, bval(geo, right, i));
}
/* Exchange left and right child in parent */
setval(geo, parent, lpos, right);
setval(geo, parent, lpos + 1, left);
/* Remove left (formerly right) child from parent */
btree_remove_level(head, geo, bkey(geo, parent, lpos), level + 1);
mempool_free(right, head->mempool);
}
static void rebalance(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, int level, unsigned long *child, int fill)
{
unsigned long *parent, *left = NULL, *right = NULL;
int i, no_left, no_right;
if (fill == 0) {
/* Because we don't steal entries from a neigbour, this case
* can happen. Parent node contains a single child, this
* node, so merging with a sibling never happens.
*/
btree_remove_level(head, geo, key, level + 1);
mempool_free(child, head->mempool);
return;
}
parent = find_level(head, geo, key, level + 1);
i = getpos(geo, parent, key);
BUG_ON(bval(geo, parent, i) != child);
if (i > 0) {
left = bval(geo, parent, i - 1);
no_left = getfill(geo, left, 0);
if (fill + no_left <= geo->no_pairs) {
merge(head, geo, level,
left, no_left,
child, fill,
parent, i - 1);
return;
}
}
if (i + 1 < getfill(geo, parent, i)) {
right = bval(geo, parent, i + 1);
no_right = getfill(geo, right, 0);
if (fill + no_right <= geo->no_pairs) {
merge(head, geo, level,
child, fill,
right, no_right,
parent, i);
return;
}
}
/*
* We could also try to steal one entry from the left or right
* neighbor. By not doing so we changed the invariant from
* "all nodes are at least half full" to "no two neighboring
* nodes can be merged". Which means that the average fill of
* all nodes is still half or better.
*/
}
static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
unsigned long *key, int level)
{
unsigned long *node;
int i, pos, fill;
void *ret;
if (level > head->height) {
/* we recursed all the way up */
head->height = 0;
head->node = NULL;
return NULL;
}
node = find_level(head, geo, key, level);
pos = getpos(geo, node, key);
fill = getfill(geo, node, pos);
if ((level == 1) && (keycmp(geo, node, pos, key) != 0))
return NULL;
ret = bval(geo, node, pos);
/* remove and shift */
for (i = pos; i < fill - 1; i++) {
setkey(geo, node, i, bkey(geo, node, i + 1));
setval(geo, node, i, bval(geo, node, i + 1));
}
clearpair(geo, node, fill - 1);
if (fill - 1 < geo->no_pairs / 2) {
if (level < head->height)
rebalance(head, geo, key, level, node, fill - 1);
else if (fill - 1 == 1)
btree_shrink(head, geo);
}
return ret;
}
void *btree_remove(struct btree_head *head, struct btree_geo *geo,
unsigned long *key)
{
if (head->height == 0)
return NULL;
return btree_remove_level(head, geo, key, 1);
}
EXPORT_SYMBOL_GPL(btree_remove);
int btree_merge(struct btree_head *target, struct btree_head *victim,
struct btree_geo *geo, gfp_t gfp)
{
unsigned long key[geo->keylen];
unsigned long dup[geo->keylen];
void *val;
int err;
BUG_ON(target == victim);
if (!(target->node)) {
/* target is empty, just copy fields over */
target->node = victim->node;
target->height = victim->height;
__btree_init(victim);
return 0;
}
/* TODO: This needs some optimizations. Currently we do three tree
* walks to remove a single object from the victim.
*/
for (;;) {
if (!btree_last(victim, geo, key))
break;
val = btree_lookup(victim, geo, key);
err = btree_insert(target, geo, key, val, gfp);
if (err)
return err;
/* We must make a copy of the key, as the original will get
* mangled inside btree_remove. */
longcpy(dup, key, geo->keylen);
btree_remove(victim, geo, dup);
}
return 0;
}
EXPORT_SYMBOL_GPL(btree_merge);
static size_t __btree_for_each(struct btree_head *head, struct btree_geo *geo,
unsigned long *node, unsigned long opaque,
void (*func)(void *elem, unsigned long opaque,
unsigned long *key, size_t index,
void *func2),
void *func2, int reap, int height, size_t count)
{
int i;
unsigned long *child;
for (i = 0; i < geo->no_pairs; i++) {
child = bval(geo, node, i);
if (!child)
break;
if (height > 1)
count = __btree_for_each(head, geo, child, opaque,
func, func2, reap, height - 1, count);
else
func(child, opaque, bkey(geo, node, i), count++,
func2);
}
if (reap)
mempool_free(node, head->mempool);
return count;
}
static void empty(void *elem, unsigned long opaque, unsigned long *key,
size_t index, void *func2)
{
}
void visitorl(void *elem, unsigned long opaque, unsigned long *key,
size_t index, void *__func)
{
visitorl_t func = __func;
func(elem, opaque, *key, index);
}
EXPORT_SYMBOL_GPL(visitorl);
void visitor32(void *elem, unsigned long opaque, unsigned long *__key,
size_t index, void *__func)
{
visitor32_t func = __func;
u32 *key = (void *)__key;
func(elem, opaque, *key, index);
}
EXPORT_SYMBOL_GPL(visitor32);
void visitor64(void *elem, unsigned long opaque, unsigned long *__key,
size_t index, void *__func)
{
visitor64_t func = __func;
u64 *key = (void *)__key;
func(elem, opaque, *key, index);
}
EXPORT_SYMBOL_GPL(visitor64);
void visitor128(void *elem, unsigned long opaque, unsigned long *__key,
size_t index, void *__func)
{
visitor128_t func = __func;
u64 *key = (void *)__key;
func(elem, opaque, key[0], key[1], index);
}
EXPORT_SYMBOL_GPL(visitor128);
size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
unsigned long opaque,
void (*func)(void *elem, unsigned long opaque,
unsigned long *key,
size_t index, void *func2),
void *func2)
{
size_t count = 0;
if (!func2)
func = empty;
if (head->node)
count = __btree_for_each(head, geo, head->node, opaque, func,
func2, 0, head->height, 0);
return count;
}
EXPORT_SYMBOL_GPL(btree_visitor);
size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
unsigned long opaque,
void (*func)(void *elem, unsigned long opaque,
unsigned long *key,
size_t index, void *func2),
void *func2)
{
size_t count = 0;
if (!func2)
func = empty;
if (head->node)
count = __btree_for_each(head, geo, head->node, opaque, func,
func2, 1, head->height, 0);
__btree_init(head);
return count;
}
EXPORT_SYMBOL_GPL(btree_grim_visitor);
static int __init btree_module_init(void)
{
btree_cachep = kmem_cache_create("btree_node", NODESIZE, 0,
SLAB_HWCACHE_ALIGN, NULL);
return 0;
}
static void __exit btree_module_exit(void)
{
kmem_cache_destroy(btree_cachep);
}
/* If core code starts using btree, initialization should happen even earlier */
module_init(btree_module_init);
module_exit(btree_module_exit);
MODULE_AUTHOR("Joern Engel <joern@logfs.org>");
MODULE_AUTHOR("Johannes Berg <johannes@sipsolutions.net>");
MODULE_LICENSE("GPL");