docs: filesystems: vfs: Use 72 character column width
In preparation for conversion to RST format use the kernels favoured documentation column width. If we are going to do this we might as well do it thoroughly. Just do the paragraphs (not the indented stuff), the rest will be done during indentation fix up patch. This patch is whitespace only, no textual changes. Use 72 character column width for all paragraph sections. Tested-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Tobin C. Harding <tobin@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
4ee33ea403
commit
90caa781f6
|
@ -12,15 +12,14 @@
|
||||||
Introduction
|
Introduction
|
||||||
============
|
============
|
||||||
|
|
||||||
The Virtual File System (also known as the Virtual Filesystem Switch)
|
The Virtual File System (also known as the Virtual Filesystem Switch) is
|
||||||
is the software layer in the kernel that provides the filesystem
|
the software layer in the kernel that provides the filesystem interface
|
||||||
interface to userspace programs. It also provides an abstraction
|
to userspace programs. It also provides an abstraction within the
|
||||||
within the kernel which allows different filesystem implementations to
|
kernel which allows different filesystem implementations to coexist.
|
||||||
coexist.
|
|
||||||
|
|
||||||
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
|
VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on
|
||||||
on are called from a process context. Filesystem locking is described
|
are called from a process context. Filesystem locking is described in
|
||||||
in the document Documentation/filesystems/Locking.
|
the document Documentation/filesystems/Locking.
|
||||||
|
|
||||||
|
|
||||||
Directory Entry Cache (dcache)
|
Directory Entry Cache (dcache)
|
||||||
|
@ -34,11 +33,10 @@ translate a pathname (filename) into a specific dentry. Dentries live
|
||||||
in RAM and are never saved to disc: they exist only for performance.
|
in RAM and are never saved to disc: they exist only for performance.
|
||||||
|
|
||||||
The dentry cache is meant to be a view into your entire filespace. As
|
The dentry cache is meant to be a view into your entire filespace. As
|
||||||
most computers cannot fit all dentries in the RAM at the same time,
|
most computers cannot fit all dentries in the RAM at the same time, some
|
||||||
some bits of the cache are missing. In order to resolve your pathname
|
bits of the cache are missing. In order to resolve your pathname into a
|
||||||
into a dentry, the VFS may have to resort to creating dentries along
|
dentry, the VFS may have to resort to creating dentries along the way,
|
||||||
the way, and then loading the inode. This is done by looking up the
|
and then loading the inode. This is done by looking up the inode.
|
||||||
inode.
|
|
||||||
|
|
||||||
|
|
||||||
The Inode Object
|
The Inode Object
|
||||||
|
@ -46,33 +44,32 @@ The Inode Object
|
||||||
|
|
||||||
An individual dentry usually has a pointer to an inode. Inodes are
|
An individual dentry usually has a pointer to an inode. Inodes are
|
||||||
filesystem objects such as regular files, directories, FIFOs and other
|
filesystem objects such as regular files, directories, FIFOs and other
|
||||||
beasts. They live either on the disc (for block device filesystems)
|
beasts. They live either on the disc (for block device filesystems) or
|
||||||
or in the memory (for pseudo filesystems). Inodes that live on the
|
in the memory (for pseudo filesystems). Inodes that live on the disc
|
||||||
disc are copied into the memory when required and changes to the inode
|
are copied into the memory when required and changes to the inode are
|
||||||
are written back to disc. A single inode can be pointed to by multiple
|
written back to disc. A single inode can be pointed to by multiple
|
||||||
dentries (hard links, for example, do this).
|
dentries (hard links, for example, do this).
|
||||||
|
|
||||||
To look up an inode requires that the VFS calls the lookup() method of
|
To look up an inode requires that the VFS calls the lookup() method of
|
||||||
the parent directory inode. This method is installed by the specific
|
the parent directory inode. This method is installed by the specific
|
||||||
filesystem implementation that the inode lives in. Once the VFS has
|
filesystem implementation that the inode lives in. Once the VFS has the
|
||||||
the required dentry (and hence the inode), we can do all those boring
|
required dentry (and hence the inode), we can do all those boring things
|
||||||
things like open(2) the file, or stat(2) it to peek at the inode
|
like open(2) the file, or stat(2) it to peek at the inode data. The
|
||||||
data. The stat(2) operation is fairly simple: once the VFS has the
|
stat(2) operation is fairly simple: once the VFS has the dentry, it
|
||||||
dentry, it peeks at the inode data and passes some of it back to
|
peeks at the inode data and passes some of it back to userspace.
|
||||||
userspace.
|
|
||||||
|
|
||||||
|
|
||||||
The File Object
|
The File Object
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
Opening a file requires another operation: allocation of a file
|
Opening a file requires another operation: allocation of a file
|
||||||
structure (this is the kernel-side implementation of file
|
structure (this is the kernel-side implementation of file descriptors).
|
||||||
descriptors). The freshly allocated file structure is initialized with
|
The freshly allocated file structure is initialized with a pointer to
|
||||||
a pointer to the dentry and a set of file operation member functions.
|
the dentry and a set of file operation member functions. These are
|
||||||
These are taken from the inode data. The open() file method is then
|
taken from the inode data. The open() file method is then called so the
|
||||||
called so the specific filesystem implementation can do its work. You
|
specific filesystem implementation can do its work. You can see that
|
||||||
can see that this is another switch performed by the VFS. The file
|
this is another switch performed by the VFS. The file structure is
|
||||||
structure is placed into the file descriptor table for the process.
|
placed into the file descriptor table for the process.
|
||||||
|
|
||||||
Reading, writing and closing files (and other assorted VFS operations)
|
Reading, writing and closing files (and other assorted VFS operations)
|
||||||
is done by using the userspace file descriptor to grab the appropriate
|
is done by using the userspace file descriptor to grab the appropriate
|
||||||
|
@ -93,11 +90,12 @@ functions:
|
||||||
extern int unregister_filesystem(struct file_system_type *);
|
extern int unregister_filesystem(struct file_system_type *);
|
||||||
|
|
||||||
The passed struct file_system_type describes your filesystem. When a
|
The passed struct file_system_type describes your filesystem. When a
|
||||||
request is made to mount a filesystem onto a directory in your namespace,
|
request is made to mount a filesystem onto a directory in your
|
||||||
the VFS will call the appropriate mount() method for the specific
|
namespace, the VFS will call the appropriate mount() method for the
|
||||||
filesystem. New vfsmount referring to the tree returned by ->mount()
|
specific filesystem. New vfsmount referring to the tree returned by
|
||||||
will be attached to the mountpoint, so that when pathname resolution
|
->mount() will be attached to the mountpoint, so that when pathname
|
||||||
reaches the mountpoint it will jump into the root of that vfsmount.
|
resolution reaches the mountpoint it will jump into the root of that
|
||||||
|
vfsmount.
|
||||||
|
|
||||||
You can see all filesystems that are registered to the kernel in the
|
You can see all filesystems that are registered to the kernel in the
|
||||||
file /proc/filesystems.
|
file /proc/filesystems.
|
||||||
|
@ -156,21 +154,21 @@ The mount() method must return the root dentry of the tree requested by
|
||||||
caller. An active reference to its superblock must be grabbed and the
|
caller. An active reference to its superblock must be grabbed and the
|
||||||
superblock must be locked. On failure it should return ERR_PTR(error).
|
superblock must be locked. On failure it should return ERR_PTR(error).
|
||||||
|
|
||||||
The arguments match those of mount(2) and their interpretation
|
The arguments match those of mount(2) and their interpretation depends
|
||||||
depends on filesystem type. E.g. for block filesystems, dev_name is
|
on filesystem type. E.g. for block filesystems, dev_name is interpreted
|
||||||
interpreted as block device name, that device is opened and if it
|
as block device name, that device is opened and if it contains a
|
||||||
contains a suitable filesystem image the method creates and initializes
|
suitable filesystem image the method creates and initializes struct
|
||||||
struct super_block accordingly, returning its root dentry to caller.
|
super_block accordingly, returning its root dentry to caller.
|
||||||
|
|
||||||
->mount() may choose to return a subtree of existing filesystem - it
|
->mount() may choose to return a subtree of existing filesystem - it
|
||||||
doesn't have to create a new one. The main result from the caller's
|
doesn't have to create a new one. The main result from the caller's
|
||||||
point of view is a reference to dentry at the root of (sub)tree to
|
point of view is a reference to dentry at the root of (sub)tree to be
|
||||||
be attached; creation of new superblock is a common side effect.
|
attached; creation of new superblock is a common side effect.
|
||||||
|
|
||||||
The most interesting member of the superblock structure that the
|
The most interesting member of the superblock structure that the mount()
|
||||||
mount() method fills in is the "s_op" field. This is a pointer to
|
method fills in is the "s_op" field. This is a pointer to a "struct
|
||||||
a "struct super_operations" which describes the next level of the
|
super_operations" which describes the next level of the filesystem
|
||||||
filesystem implementation.
|
implementation.
|
||||||
|
|
||||||
Usually, a filesystem uses one of the generic mount() implementations
|
Usually, a filesystem uses one of the generic mount() implementations
|
||||||
and provides a fill_super() callback instead. The generic variants are:
|
and provides a fill_super() callback instead. The generic variants are:
|
||||||
|
@ -317,16 +315,16 @@ or bottom half).
|
||||||
implementations will cause holdoff problems due to large scan batch
|
implementations will cause holdoff problems due to large scan batch
|
||||||
sizes.
|
sizes.
|
||||||
|
|
||||||
Whoever sets up the inode is responsible for filling in the "i_op" field. This
|
Whoever sets up the inode is responsible for filling in the "i_op"
|
||||||
is a pointer to a "struct inode_operations" which describes the methods that
|
field. This is a pointer to a "struct inode_operations" which describes
|
||||||
can be performed on individual inodes.
|
the methods that can be performed on individual inodes.
|
||||||
|
|
||||||
struct xattr_handlers
|
struct xattr_handlers
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
On filesystems that support extended attributes (xattrs), the s_xattr
|
On filesystems that support extended attributes (xattrs), the s_xattr
|
||||||
superblock field points to a NULL-terminated array of xattr handlers. Extended
|
superblock field points to a NULL-terminated array of xattr handlers.
|
||||||
attributes are name:value pairs.
|
Extended attributes are name:value pairs.
|
||||||
|
|
||||||
name: Indicates that the handler matches attributes with the specified name
|
name: Indicates that the handler matches attributes with the specified name
|
||||||
(such as "system.posix_acl_access"); the prefix field must be NULL.
|
(such as "system.posix_acl_access"); the prefix field must be NULL.
|
||||||
|
@ -346,9 +344,9 @@ attributes are name:value pairs.
|
||||||
attribute. This method is called by the the setxattr(2) and
|
attribute. This method is called by the the setxattr(2) and
|
||||||
removexattr(2) system calls.
|
removexattr(2) system calls.
|
||||||
|
|
||||||
When none of the xattr handlers of a filesystem match the specified attribute
|
When none of the xattr handlers of a filesystem match the specified
|
||||||
name or when a filesystem doesn't support extended attributes, the various
|
attribute name or when a filesystem doesn't support extended attributes,
|
||||||
*xattr(2) system calls return -EOPNOTSUPP.
|
the various *xattr(2) system calls return -EOPNOTSUPP.
|
||||||
|
|
||||||
|
|
||||||
The Inode Object
|
The Inode Object
|
||||||
|
@ -360,8 +358,8 @@ An inode object represents an object within the filesystem.
|
||||||
struct inode_operations
|
struct inode_operations
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
This describes how the VFS can manipulate an inode in your
|
This describes how the VFS can manipulate an inode in your filesystem.
|
||||||
filesystem. As of kernel 2.6.22, the following members are defined:
|
As of kernel 2.6.22, the following members are defined:
|
||||||
|
|
||||||
struct inode_operations {
|
struct inode_operations {
|
||||||
int (*create) (struct inode *,struct dentry *, umode_t, bool);
|
int (*create) (struct inode *,struct dentry *, umode_t, bool);
|
||||||
|
@ -517,42 +515,40 @@ The Address Space Object
|
||||||
========================
|
========================
|
||||||
|
|
||||||
The address space object is used to group and manage pages in the page
|
The address space object is used to group and manage pages in the page
|
||||||
cache. It can be used to keep track of the pages in a file (or
|
cache. It can be used to keep track of the pages in a file (or anything
|
||||||
anything else) and also track the mapping of sections of the file into
|
else) and also track the mapping of sections of the file into process
|
||||||
process address spaces.
|
address spaces.
|
||||||
|
|
||||||
There are a number of distinct yet related services that an
|
There are a number of distinct yet related services that an
|
||||||
address-space can provide. These include communicating memory
|
address-space can provide. These include communicating memory pressure,
|
||||||
pressure, page lookup by address, and keeping track of pages tagged as
|
page lookup by address, and keeping track of pages tagged as Dirty or
|
||||||
Dirty or Writeback.
|
Writeback.
|
||||||
|
|
||||||
The first can be used independently to the others. The VM can try to
|
The first can be used independently to the others. The VM can try to
|
||||||
either write dirty pages in order to clean them, or release clean
|
either write dirty pages in order to clean them, or release clean pages
|
||||||
pages in order to reuse them. To do this it can call the ->writepage
|
in order to reuse them. To do this it can call the ->writepage method
|
||||||
method on dirty pages, and ->releasepage on clean pages with
|
on dirty pages, and ->releasepage on clean pages with PagePrivate set.
|
||||||
PagePrivate set. Clean pages without PagePrivate and with no external
|
Clean pages without PagePrivate and with no external references will be
|
||||||
references will be released without notice being given to the
|
released without notice being given to the address_space.
|
||||||
address_space.
|
|
||||||
|
|
||||||
To achieve this functionality, pages need to be placed on an LRU with
|
To achieve this functionality, pages need to be placed on an LRU with
|
||||||
lru_cache_add and mark_page_active needs to be called whenever the
|
lru_cache_add and mark_page_active needs to be called whenever the page
|
||||||
page is used.
|
is used.
|
||||||
|
|
||||||
Pages are normally kept in a radix tree index by ->index. This tree
|
Pages are normally kept in a radix tree index by ->index. This tree
|
||||||
maintains information about the PG_Dirty and PG_Writeback status of
|
maintains information about the PG_Dirty and PG_Writeback status of each
|
||||||
each page, so that pages with either of these flags can be found
|
page, so that pages with either of these flags can be found quickly.
|
||||||
quickly.
|
|
||||||
|
|
||||||
The Dirty tag is primarily used by mpage_writepages - the default
|
The Dirty tag is primarily used by mpage_writepages - the default
|
||||||
->writepages method. It uses the tag to find dirty pages to call
|
->writepages method. It uses the tag to find dirty pages to call
|
||||||
->writepage on. If mpage_writepages is not used (i.e. the address
|
->writepage on. If mpage_writepages is not used (i.e. the address
|
||||||
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
|
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
|
||||||
almost unused. write_inode_now and sync_inode do use it (through
|
unused. write_inode_now and sync_inode do use it (through
|
||||||
__sync_single_inode) to check if ->writepages has been successful in
|
__sync_single_inode) to check if ->writepages has been successful in
|
||||||
writing out the whole address_space.
|
writing out the whole address_space.
|
||||||
|
|
||||||
The Writeback tag is used by filemap*wait* and sync_page* functions,
|
The Writeback tag is used by filemap*wait* and sync_page* functions, via
|
||||||
via filemap_fdatawait_range, to wait for all writeback to complete.
|
filemap_fdatawait_range, to wait for all writeback to complete.
|
||||||
|
|
||||||
An address_space handler may attach extra information to a page,
|
An address_space handler may attach extra information to a page,
|
||||||
typically using the 'private' field in the 'struct page'. If such
|
typically using the 'private' field in the 'struct page'. If such
|
||||||
|
@ -562,25 +558,24 @@ handler to deal with that data.
|
||||||
|
|
||||||
An address space acts as an intermediate between storage and
|
An address space acts as an intermediate between storage and
|
||||||
application. Data is read into the address space a whole page at a
|
application. Data is read into the address space a whole page at a
|
||||||
time, and provided to the application either by copying of the page,
|
time, and provided to the application either by copying of the page, or
|
||||||
or by memory-mapping the page.
|
by memory-mapping the page. Data is written into the address space by
|
||||||
Data is written into the address space by the application, and then
|
the application, and then written-back to storage typically in whole
|
||||||
written-back to storage typically in whole pages, however the
|
pages, however the address_space has finer control of write sizes.
|
||||||
address_space has finer control of write sizes.
|
|
||||||
|
|
||||||
The read process essentially only requires 'readpage'. The write
|
The read process essentially only requires 'readpage'. The write
|
||||||
process is more complicated and uses write_begin/write_end or
|
process is more complicated and uses write_begin/write_end or
|
||||||
set_page_dirty to write data into the address_space, and writepage
|
set_page_dirty to write data into the address_space, and writepage and
|
||||||
and writepages to writeback data to storage.
|
writepages to writeback data to storage.
|
||||||
|
|
||||||
Adding and removing pages to/from an address_space is protected by the
|
Adding and removing pages to/from an address_space is protected by the
|
||||||
inode's i_mutex.
|
inode's i_mutex.
|
||||||
|
|
||||||
When data is written to a page, the PG_Dirty flag should be set. It
|
When data is written to a page, the PG_Dirty flag should be set. It
|
||||||
typically remains set until writepage asks for it to be written. This
|
typically remains set until writepage asks for it to be written. This
|
||||||
should clear PG_Dirty and set PG_Writeback. It can be actually
|
should clear PG_Dirty and set PG_Writeback. It can be actually written
|
||||||
written at any point after PG_Dirty is clear. Once it is known to be
|
at any point after PG_Dirty is clear. Once it is known to be safe,
|
||||||
safe, PG_Writeback is cleared.
|
PG_Writeback is cleared.
|
||||||
|
|
||||||
Writeback makes use of a writeback_control structure to direct the
|
Writeback makes use of a writeback_control structure to direct the
|
||||||
operations. This gives the the writepage and writepages operations some
|
operations. This gives the the writepage and writepages operations some
|
||||||
|
@ -609,9 +604,10 @@ file descriptors should get back an error is not possible.
|
||||||
Instead, the generic writeback error tracking infrastructure in the
|
Instead, the generic writeback error tracking infrastructure in the
|
||||||
kernel settles for reporting errors to fsync on all file descriptions
|
kernel settles for reporting errors to fsync on all file descriptions
|
||||||
that were open at the time that the error occurred. In a situation with
|
that were open at the time that the error occurred. In a situation with
|
||||||
multiple writers, all of them will get back an error on a subsequent fsync,
|
multiple writers, all of them will get back an error on a subsequent
|
||||||
even if all of the writes done through that particular file descriptor
|
fsync, even if all of the writes done through that particular file
|
||||||
succeeded (or even if there were no writes on that file descriptor at all).
|
descriptor succeeded (or even if there were no writes on that file
|
||||||
|
descriptor at all).
|
||||||
|
|
||||||
Filesystems that wish to use this infrastructure should call
|
Filesystems that wish to use this infrastructure should call
|
||||||
mapping_set_error to record the error in the address_space when it
|
mapping_set_error to record the error in the address_space when it
|
||||||
|
@ -623,8 +619,8 @@ point in the stream of errors emitted by the backing device(s).
|
||||||
struct address_space_operations
|
struct address_space_operations
|
||||||
-------------------------------
|
-------------------------------
|
||||||
|
|
||||||
This describes how the VFS can manipulate mapping of a file to page cache in
|
This describes how the VFS can manipulate mapping of a file to page
|
||||||
your filesystem. The following members are defined:
|
cache in your filesystem. The following members are defined:
|
||||||
|
|
||||||
struct address_space_operations {
|
struct address_space_operations {
|
||||||
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
int (*writepage)(struct page *page, struct writeback_control *wbc);
|
||||||
|
@ -1231,8 +1227,8 @@ filesystems.
|
||||||
Showing options
|
Showing options
|
||||||
---------------
|
---------------
|
||||||
|
|
||||||
If a filesystem accepts mount options, it must define show_options()
|
If a filesystem accepts mount options, it must define show_options() to
|
||||||
to show all the currently active options. The rules are:
|
show all the currently active options. The rules are:
|
||||||
|
|
||||||
- options MUST be shown which are not default or their values differ
|
- options MUST be shown which are not default or their values differ
|
||||||
from the default
|
from the default
|
||||||
|
@ -1240,14 +1236,14 @@ to show all the currently active options. The rules are:
|
||||||
- options MAY be shown which are enabled by default or have their
|
- options MAY be shown which are enabled by default or have their
|
||||||
default value
|
default value
|
||||||
|
|
||||||
Options used only internally between a mount helper and the kernel
|
Options used only internally between a mount helper and the kernel (such
|
||||||
(such as file descriptors), or which only have an effect during the
|
as file descriptors), or which only have an effect during the mounting
|
||||||
mounting (such as ones controlling the creation of a journal) are exempt
|
(such as ones controlling the creation of a journal) are exempt from the
|
||||||
from the above rules.
|
above rules.
|
||||||
|
|
||||||
The underlying reason for the above rules is to make sure, that a
|
The underlying reason for the above rules is to make sure, that a mount
|
||||||
mount can be accurately replicated (e.g. umounting and mounting again)
|
can be accurately replicated (e.g. umounting and mounting again) based
|
||||||
based on the information found in /proc/mounts.
|
on the information found in /proc/mounts.
|
||||||
|
|
||||||
Resources
|
Resources
|
||||||
=========
|
=========
|
||||||
|
|
Loading…
Reference in New Issue