Userspace can now query and specify the cluster stack in use via the
/sys/fs/ocfs2/cluster_stack file. By default, it is 'o2cb', which is
the classic stack. Thus, old tools that do not know how to modify this
file will work just fine. The stack cannot be modified if there is a
live filesystem.
ocfs2_cluster_connect() now takes the expected cluster stack as an
argument. This way, the filesystem and the stack glue ensure they are
speaking to the same backend.
If the stack is 'o2cb', the o2cb stack plugin is used. For any other
value, the fsdlm stack plugin is selected.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c. This becomes the
ocfs2_stack_o2cb.ko module.
The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module. This module now provides an interface to
register drivers. The ocfs2_stack_o2cb driver registers itself. As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().
ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dlmglue.c was still referencing a raw o2dlm lksb in one instance. Let's
create a generic ocfs2_dlm_dump_lksb() function. This allows underlying
DLMs to print whatever they want about their lock.
We then move the o2dlm dump into stackglue.c where it belongs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When using fsdlm, -EAGAIN is returned in the async callback for NOQUEUE
requests. Fix up dlmglue to expect this.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
o2dlm has the non-standard behavior of providing a cancel callback
(unlock_ast) even when the cancel has failed (the locking operation
succeeded without canceling). This is called CANCELGRANT after the
status code sent to the callback. fs/dlm does not provide this
callback, so dlmglue must be changed to live without it.
o2dlm_unlock_ast_wrapper() in stackglue now ignores CANCELGRANT calls.
Because dlmglue no longer sees CANCELGRANT, ocfs2_unlock_ast() no longer
needs to check for it. ocfs2_locking_ast() must catch that a cancel was
tried and clear the cancel state.
Making these changes opens up a locking race. dlmglue uses the the
OCFS2_LOCK_BUSY flag to ensure only one thread is calling the dlm at any
one time. But dlmglue must unlock the lockres before calling into the
dlm. In the small window of time between unlocking the lockres and
calling the dlm, the downconvert thread can try to cancel the lock. The
downconvert thread is checking the OCFS2_LOCK_BUSY flag - it doesn't
know that ocfs2_dlm_lock() has not yet been called.
Because ocfs2_dlm_lock() has not yet been called, the cancel operation
will just be a no-op. There's nothing to cancel. With CANCELGRANT,
dlmglue uses the CANCELGRANT callback to clear up the cancel state.
When it comes around again, it will retry the cancel. Eventually, the
first thread will have called into ocfs2_dlm_lock(), and either the
lock or the cancel will succeed. The downconvert thread can then do its
downconvert.
Without CANCELGRANT, there is nothing to clean up the cancellation
state. The downconvert thread does not know to retry its operations.
More importantly, the original lock may be blocking on the other node
that is trying to cancel us. With neither able to make progress, the
ast is never called and the cancellation state is never cleaned up that
way. dlmglue is deadlocked.
The OCFS2_LOCK_PENDING flag is introduced to remedy this window. It is
set at the same time OCFS2_LOCK_BUSY is. Thus, the downconvert thread
can check whether the lock is cancelable. If not, it just loops around
to try again. Once ocfs2_dlm_lock() is called, the thread then clears
OCFS2_LOCK_PENDING and wakes the downconvert thread. Now, if the
downconvert thread finds the lock BUSY, it can safely try to cancel it.
Whether the cancel works or not, the state will be properly set and the
lock processing can continue.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
It doesn't make sense to query for a node number before connecting to the
cluster stack. This should be safe to do because node_num is only just
printed,
and we're actually only moving the setting of node num a small amount
further in the mount process.
[ Disconnect when node query fails -- Joel ]
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.
We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl. Other stacks will just leave hangup() empty.
The check for heartbeat is moved into ocfs2_cluster_connect(). It will
be matched by a similar check for other stacks.
With this change, only stackglue.c includes cluster/ headers.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This step introduces a cluster stack agnostic API for initializing and
exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack. It is all handled in stackglue.c.
heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.
The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
o2dlm initialization in one block. Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain. Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call. I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect(). The filesystem has already set
itself to ignore the callback.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Wrap the lock status block (lksb) in a union. Later we will add a union
element for the fs/dlm lksb. Create accessors for the status and lvb
fields.
Other than a debugging function, dlmglue.c does not directly reference
the o2dlm locking path anymore.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Change the ocfs2_dlm_lock/unlock() functions to return -errno values.
This is the first step towards elminiating dlm_status in
fs/ocfs2/dlmglue.c. The change also passes -errno values to
->unlock_ast().
[ Fix a return code in dlmglue.c and change the error translation table into
an array of ints. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2 generic code should use the values in <linux/dlmconstants.h>.
stackglue.c will convert them to o2dlm values.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The old recovery map was a bitmap of node numbers. This was sufficient
for the maximum node number of 254. Going forward, we want node numbers
to be UINT32. Thus, we need a new recovery map.
Note that we can't keep track of slots here. We must write down the
node number to recovery *before* we get the locks needed to convert a
node number into a slot number.
The recovery map is now an array of unsigned ints, max_slots in size.
It moves to journal.c with the rest of recovery.
Because it needs to be initialized, we move all of recovery initialization
into a new function, ocfs2_recovery_init(). This actually cleans up
ocfs2_initialize_super() a little as well. Following on, recovery cleaup
becomes part of ocfs2_recovery_exit().
A number of node map functions are rendered obsolete and are removed.
Finally, waiting on recovery is wrapped in a function rather than naked
checks on the recovery_event. This is a cleanup from Mark.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
journal.c and dlmglue.c would refresh the slot map by hand. Instead, have
the update and clear functions do the work inside slot_map.c. The eventual
result is to make ocfs2_slot_info defined privately in slot_map.c
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes the needlessly global ocfs2_downconvert_thread()
static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch contains the following cleanups that are now possible:
- make the following needlessly global functions static:
- dlmglue.c:ocfs2_process_blocked_lock()
- heartbeat.c:ocfs2_node_map_init()
- #if 0 the following unused global function plus support functions:
- heartbeat.c:ocfs2_node_map_is_only()
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Commit f1f540688e "optimized"
ocfs2_data_convert_worker() to "only do work for regular files".
Unfortunately, I left out a '!', which casued it to *skip* regular files.
This was hidden from testing until recently because the default data
journaling mode (data=ordered) doesn't exercise this code.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Currently, when ocfs2 nodes connect via TCP, they advertise their
compatibility level. If the versions do not match, two nodes cannot speak
to each other and they disconnect. As a result, this provides no forward or
backwards compatibility.
This patch implements a simple protocol negotiation at the dlm level by
introducing a major/minor version number scheme for entities that
communicate. Specifically, o2dlm has a major/minor version for interaction
with o2dlm on other nodes, and ocfs2 itself has a major/minor version for
interacting with the filesystem on other nodes.
This will allow rolling upgrades of ocfs2 clusters when changes to the
locking or network protocols can be done in a backwards compatible manner.
In those cases, only the minor number is changed and the negotatied protocol
minor is returned from dlm join. In the far less likely event that a
required protocol change makes backwards compatibility impossible, we simply
bump the major number.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This adds a new dlmglue lock type which is intended to back flock()
requests.
Since these locks are driven from userspace, usage rules are much more
liberal than the typical Ocfs2 internal cluster lock. As a result, we can't
make use of most dlmglue features - lock caching and lock level
optimizations in particular. Additionally, userspace is free to deadlock
itself, so we have to deal with that in the same way as the rest of the
kernel - by allowing a signal to abort a lock request.
In order to keep ocfs2_cluster_lock() complexity down, ocfs2_file_lock()
does it's own dlm coordination. We still use the same helper functions
though, so duplicated code is kept to a minimum.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Call this the "inode_lock" now, since it covers both data and meta data.
This patch makes no functional changes.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The meta lock now covers both meta data and data, so this just removes the
now-redundant data lock.
Combining locks saves us a round of lock mastery per inode and one less lock
to ping between nodes during read/write.
We don't lose much - since meta locks were always held before a data lock
(and at the same level) ordered writeout mode (the default) ensured that
flushing for the meta data lock also pushed out data anyways.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
In order to extend inode lock coverage to inode data, we use the same data
downconvert worker with only a small modification to only do work for
regular files.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The node maps that are set/unset by these votes are no longer relevant, thus
we can remove the mount and umount votes. Since those are the last two
remaining votes, we can also remove the entire vote infrastructure.
The vote thread has been renamed to the downconvert thread, and the small
amount of functionality related to managing it has been moved into
fs/ocfs2/dlmglue.c. All references to votes have been removed or updated.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
If we have not yet created a cluster lock, ocfs2_cluster_lock() will
first create it at NLMODE, and then convert the lock to either PRMODE or
EXMODE (whichever is requested).
Change ocfs2_cluster_lock() to just create the lock at the initially
requested level. ocfs2_locking_ast() handles this case fine, so the only
update required was in setup of locking state. This should reduce the number
of network messages required for a new lock by one, providing an incremental
performance enhancement.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Add the disk, network and memory structures needed to support data in inode.
Struct ocfs2_inline_data is defined and embedded in ocfs2_dinode for storing
inline data.
A new inode field, i_dyn_features, is added to facilitate tracking of
dynamic inode state. Since it will be used often, we want to mirror it on
ocfs2_inode_info, and transfer it via the meta data lvb.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Remove includes of <linux/smp_lock.h> where it is not used/needed.
Suggested by Al Viro.
Builds cleanly on x86_64, i386, alpha, ia64, powerpc, sparc,
sparc64, and arm (all 59 defconfigs).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the following needlessly global functions static:
- aops.c: ocfs2_write_data_page()
- dlmglue.c: ocfs2_dump_meta_lvb_info()
- file.c: ocfs2_set_inode_size()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The extent map code was ripped out earlier because of an inability to deal
with holes. This patch adds back a simpler caching scheme requiring far less
code.
Our old extent map caching was designed back when meta data block caching in
Ocfs2 didn't work very well, resulting in many disk reads. These days our
metadata caching is much better, resulting in no un-necessary disk reads. As
a result, extent caching doesn't have to be as fancy, nor does it have to
cache as many extents. Keeping the last 3 extents seen should be sufficient
to give us a small performance boost on some streaming workloads.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Older file systems which didn't support holes did a dumb calculation of
i_blocks based on i_size. This is no longer accurate, so fix things up to
take actual allocation into account.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The code in extent_map.c is not prepared to deal with a subtree being
rotated between lookups. This can happen when filling holes in sparse files.
Instead of a lengthy patch to update the code (which would likely lose the
benefit of caching subtree roots), we remove most of the algorithms and
implement a simple path based lookup. A less ambitious extent caching scheme
will be added in a later patch.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Ocfs2 currently does cluster-wide node messaging to check the open state of
an inode during delete. This patch removes that mechanism in favor of an
inode cluster lock which is taken at shared read when an inode is first read
and dropped in clear_inode(). This allows a deleting node to test the
liveness of an inode by attempting to take an exclusive lock.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
We don't want the extent map and uptodate cache destruction in
ocfs2_meta_lock_update() on a local mount, so skip that.
This fixes several bugs with uptodate being cleared on buffers and extent
maps being corrupted.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Mmap-heavy clustered workloads were sometimes finding stale data on mmap
reads. The solution is to call unmap_mapping_range() on any down convert of
a data lock.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This allows users to format an ocfs2 file system with a special flag,
OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT. When the file system sees this flag, it
will not use any cluster services, nor will it require a cluster
configuration, thus acting like a 'local' file system.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch adds the core routines for updating atime in ocfs2.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This gets us rid of a slab we no longer need, as well as removing the
majority of what's left on ocfs2_journal_handle.
ocfs2_commit_unstarted_handle() has no more real work to do, so remove that
function too.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
This patch makes the needlessly global ocfs2_create_new_lock() static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86. (It would be more on an x86_64 system). This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).
This patch:
The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer. Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer. This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.
[judith@osdl.org: powerpc build fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Judith Lebzelter <judith@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With this, we don't need to pass an additional struct with function pointer.
Now that the callbacks are fully used, comment the remaining API.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Have ocfs2_process_blocked_lock() call ocfs2_generic_unblock_lock(), which
gets to be ocfs2_unblock_lock() now that it's the only possible unblock
function.
Remove the ->unblock() callback from the structure, and all lock type
specific unblock functions.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
The meta data unblocking code no longer needs ocfs2_do_unblock_meta() or
ocfs2_can_downconvert_meta_lock(), so remove them.
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>
Fill in the ->check_downconvert and ->set_lvb callbacks with meta data
specific operations and switch ocfs2_unblock_meta() to call
ocfs2_generic_unblock_lock()
Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>