There is a race window in dlm_get_lock_resource(), which may return a
lock resource which has been purged. This will cause the process to
hang forever in dlmlock() as the ast msg can't be handled due to its
lock resource not existing.
dlm_get_lock_resource {
...
spin_lock(&dlm->spinlock);
tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
if (tmpres) {
spin_unlock(&dlm->spinlock);
>>>>>>>> race window, dlm_run_purge_list() may run and purge
the lock resource
spin_lock(&tmpres->spinlock);
...
spin_unlock(&tmpres->spinlock);
}
}
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A tiny race between BAST and unlock message causes the NULL dereference.
A node sends an unlock request to master and receives a response. Before
processing the response it receives a BAST from the master. Since both
requests are processed by different threads it creates a race. While the
BAST is being processed, lock can get freed by unlock code.
This patch makes bast to return immediately if lock is found but unlock is
pending. The code should handle this race. We also have to fix master
node to skip sending BAST after receiving unlock message.
Below is the crash stack
BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
IP: o2dlm_blocking_ast_wrapper+0xd/0x16
dlm_do_local_bast+0x8e/0x97 [ocfs2_dlm]
dlm_proxy_ast_handler+0x838/0x87e [ocfs2_dlm]
o2net_process_message+0x395/0x5b8 [ocfs2_nodemanager]
o2net_rx_until_empty+0x762/0x90d [ocfs2_nodemanager]
worker_thread+0x14d/0x1ed
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove dlm_joined() that is not used anywhere.
This was partially found by using a static code analysis program called
cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use snprintf format specifier "%lu" instead of "%ld" for argument of type
'unsigned long'.
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the recovery master is down, the owner of $RECOVERY calls
dlm_do_local_recovery_cleanup() to prune any $RECOVERY entries for dead
nodes. The lock is in the granted list and the refcount must be 2. We
should put twice to remove this lock. Otherwise, it will lead to a memory
leak.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: yangwenfang <vicky.yangwenfang@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_process_recovery_data, only when dlm_new_lock failed the ret will
be set to -ENOMEM. And in this case, newlock is definitely NULL. So
test newlock is meaningless, remove it.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ac4fef4d23 ("ocfs2/dlm: do not purge lockres that is queued for
assert master") may have the following possible race case:
dlm_dispatch_assert_master dlm_wq
========================================================================
queue_work(dlm->quedlm_worker,
&dlm->dispatched_work);
dispatch work,
dlm_lockres_drop_inflight_worker
*BUG_ON(res->inflight_assert_workers == 0)*
dlm_lockres_grab_inflight_worker
inflight_assert_workers++
So ensure inflight_assert_workers to be increased first.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Xue jiufei <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 1faf289454 ("ocfs2_dlm: disallow a domain join if node maps
mismatch") we introduced a new earlier NULL check so this one is not
needed. Also static checkers complain because we dereference it first
and then check for NULL.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Node A sends master query request to node B which is the master. At this
time lockres happens to be on purgelist. dlm_master_request_handler gets
the dlm spinlock, finds the resource and releases the dlm spin lock.
Right at this dlm_thread on this node could purge the lockres.
dlm_master_request_handler can then acquire lockres spinlock and reply to
Node A that node B is the master even though lockres on node B is purged.
The above scenario will now make node A falsely think node B is the master
which is inconsistent. Further if another node C tries to master the same
resource, every node will respond they are not the master. Node C then
masters the resource and sends assert master to all nodes. This will now
make node A crash with the following message.
dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current
owner is 10!
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Tested-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Do not BUG() if GFP_ATOMIC allocation fails in dlm_dispatch_assert_master.
Instead, return -ENOMEM to the sender and then retry.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following case may lead to o2net_wq and o2hb thread deadlock on
o2hb_callback_sem.
Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at
the same time, N3 tries to join the cluster. So N1 will handle node
down (N2) and join (N3) simultaneously.
o2hb o2net_wq
->o2hb_do_disk_heartbeat
->o2hb_check_slot
->o2hb_run_event_list
->o2hb_fire_callbacks
->down_write(&o2hb_callback_sem)
->o2net_hb_node_down_cb
->flush_workqueue(o2net_wq)
->o2net_process_message
->dlm_query_join_handler
->o2hb_check_node_heartbeating
->o2hb_fill_node_map
->down_read(&o2hb_callback_sem)
No need to take o2hb_callback_sem in dlm_query_join_handler,
o2hb_live_lock is enough to protect live node map.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: xMark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: jiangyiwen <jiangyiwen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the branch that free res->lockname.name because the condition
is never satisfied when jump to label error.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_lockres_put() should be called without &res->spinlock, otherwise a
deadlock case may happen.
spin_lock(&res->spinlock)
...
dlm_lockres_put
->dlm_lockres_release
->dlm_print_one_lock_resource
->spin_lock(&res->spinlock)
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refactoring error handling in dlm_alloc_ctxt to simplify code.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_assert_master_handler, the mle is get in dlm_find_mle, should be
put when goto kill, otherwise, this mle will never be released.
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a deadlock case which reported by Guozhonghua:
https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html
This case is caused by &res->spinlock and &dlm->master_lock
misordering in different threads.
It was introduced by commit 8d400b81cc ("ocfs2/dlm: Clean up refmap
helpers"). Since lockres is new, it doesn't not require the
&res->spinlock. So remove it.
Fixes: 8d400b81cc ("ocfs2/dlm: Clean up refmap helpers")
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 19074140
When umount is issued during recovery on the new master that has not
finished remastering locks, it triggers BUG() in
dlm_send_mig_lockres_msg(). Here is the situation:
1) node A has a lock on resource X mastered by node B.
2) node B dies -> node A sets recovering flag for res X
3) Node C becomes the new master for resources owned by the
dead node and is remastering locks of the dead node but
has not finished the remastering process yet.
4) umount is issued on node C.
5) During processing of umount, ignoring unfished recovery,
node C attempts to migrate resource X to node A.
6) node A finds res X in DLM_LOCK_RES_RECOVERING state, considers
it a logic error and sends back -EFAULT.
7) node C asserts BUG() upon seeing EFAULT resp from node B.
Fix is to delay migrating res X till remastering is finished at which
point recovering flag will be cleared on both A and C.
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The unit of total_backoff is msecs not jiffies, so no need to do the
conversion. Otherwise, the join timeout is not 90 sec.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When workqueue is delayed, it may occur that a lockres is purged while it
is still queued for master assert. it may trigger BUG() as follows.
N1 N2
dlm_get_lockres()
->dlm_do_master_requery
is the master of lockres,
so queue assert_master work
dlm_thread() start running
and purge the lockres
dlm_assert_master_worker()
send assert master message
to other nodes
receiving the assert_master
message, set master to N2
dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
if it is RECOVERY lockres, it triggers the BUG().
Another BUG() is triggered when N3 become the new master and send
assert_master to N1, N1 will trigger the BUG() because owner doesn't
match. So we should not purge lockres when it is queued for assert
master.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following case may lead to endless loop during umount.
node A node B node C node D
umount volume,
migrate lockres1
to B
want to lock lockres1,
send
MASTER_REQUEST_MSG
to C
init block mle
send
MIGRATE_REQUEST_MSG
to C
find a block
mle, and then
return
DLM_MIGRATE_RESPONSE_MASTERY_REF
to B
set C in refmap
umount successfully
try to umount, endless
loop occurs when migrate
lockres1 since C is in
refmap
So we can fix this endless loop case by only returning
DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
MIGRATE_REQUEST_MSG.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a lockres in purge list but is still in use, it should be moved to
the tail of purge list. dlm_thread will continue to check next lockres in
purge list. However, code list_move_tail(&dlm->purge_list,
&lockres->purge) will do *no* movements, so dlm_thread will purge the same
lockres in this loop again and again. If it is in use for a long time,
other lockres will not be processed.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 18639535
Two node cluster and both nodes hold a lock at PR level and both want to
convert to EX at the same time. Master node 1 has sent BAST and then
closes the connection due to idletime out. Node 0 receives BAST, sends
unlock req with cancel flag but gets error -ENOTCONN. The problem is
this error is ignored in dlm_send_remote_unlock_request() on the
**incorrect** assumption that the master is dead. See NOTE in comment
why it returns DLM_NORMAL. Upon getting DLM_NORMAL, node 0 proceeds to
sends convert (without cancel flg) which fails with -ENOTCONN. waits 5
sec and resends.
This time gets DLM_IVLOCKID from the master since lock not found in
grant, it had been moved to converting queue in response to conv PR->EX
req. No way out.
Node 1 (master) Node 0
============== ======
lock mode PR PR
convert PR -> EX
mv grant -> convert and que BAST
...
<-------- convert PR -> EX
convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
...
BAST (want PR -> NL)
------------------>
...
idle timout, conn closed
...
In response to BAST,
sends unlock with cancel convert flag
gets -ENOTCONN. Ignores and
sends remote convert request
gets -ENOTCONN, waits 5 Sec, retries
...
reconnects
<----------------- convert req goes through on next try
does not find lock on grant que
status DLM_IVLOCKID
------------------>
...
No way out. Fix is to keep retrying unlock with cancel flag until it
succeeds or the master dies.
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_recovery_ctxt.received is unused.
ocfs2_should_refresh_lock_res() can only return 0 or 1, so the error
handling code in ocfs2_super_lock() is unneeded.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We found a race situation when dlm recovery and node joining occurs
simultaneously if the network state is bad.
N1 N4
start joining dlm and send
query join to all live nodes
set joining node to N1, return OK
send query join to other
live nodes and it may take
a while
call dlm_send_join_assert()
to send assert join message
when N2 is down, so keep
trying to send message to N2
until find N2 is down
send assert join message to
N3, but connection is down
with N3, so it may take a
while
become the recovery master for N2
and send begin reco message to other
nodes in domain map but no N1
connection with N3 is rebuild,
then send assert join to N4
call dlm_assert_joined_handler(),
add N1 to domain_map
dlm recovery done, send finalize message
to nodes in domain map, including N1
receiving finalize message,
trigger the BUG() because
recovery master mismatch.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We found there is a conversion deadlock when the owner of lockres
happened to crash before send DLM_PROXY_AST_MSG for a downconverting
lock. The situation is as follows:
Node1 Node2 Node3
the owner of lockresA
lock_1 granted at EX mode
and call ocfs2_cluster_unlock
to decrease ex_holders.
converting lock_3 from
NL to EX
send DLM_PROXY_AST_MSG
to Node1, asking Node 1
to downconvert.
receiving DLM_PROXY_AST_MSG,
thread ocfs2dc send
DLM_CONVERT_LOCK_MSG
to Node2 to downconvert
lock_1(EX->NL).
lock_1 can be granted and
put it into pending_asts
list, return DLM_NORMAL.
then something happened
and Node2 crashed.
received DLM_NORMAL, waiting
for DLM_PROXY_AST_MSG.
selected as the recovery
master, receving migrate
lock from Node1, queue
lock_1 to the tail of
converting list.
After dlm recovery, converting list in the master of lockresA(Node3)
will be: converting list head <-> lock_3(NL->EX) <->lock_1(EX<->NL).
Requested mode of lock_3 is not compatible with the granted mode of
lock_1, so it can not be granted. and lock_1 can not downconvert
because covnerting queue is strictly FIFO. So a deadlock is created.
We think function dlm_process_recovery_data() should queue_ast for
lock_1 or alter the order of lock_1 and lock_3, so dlm_thread can
process lock_1 first. And if there are multiple downconverting locks,
they must convert form PR to NL, so no need to sort them.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Static values are automatically initialized to NULL.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_init, if create dlm_lockname_cache failed in
dlm_init_master_caches, it will destroy dlm_lockres_cache which created
before twice. And this will cause system die when loading modules.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_query_region_handler(), once kmalloc failed, it will unlock
dlm_domain_lock without lock first, then deadlock happens.
Signed-off-by: Zhonghua Guo <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Tested-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race window in dlm_do_recovery() between dlm_remaster_locks()
and dlm_reset_recovery() when the recovery master nearly finish the
recovery process for a dead node. After the master sends FINALIZE_RECO
message in dlm_remaster_locks(), another node may become the recovery
master for another dead node, and then send the BEGIN_RECO message to
all the nodes included the old master, in the handler of this message
dlm_begin_reco_handler() of old master, dlm->reco.dead_node and
dlm->reco.new_master will be set to the second dead node and the new
master, then in dlm_reset_recovery(), these two variables will be reset
to default value. This will cause new recovery master can not finish
the recovery process and hung, at last the whole cluster will hung for
recovery.
old recovery master: new recovery master:
dlm_remaster_locks()
become recovery master for
another dead node.
dlm_send_begin_reco_message()
dlm_begin_reco_handler()
{
if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
return -EAGAIN;
}
dlm_set_reco_master(dlm, br->node_idx);
dlm_set_reco_dead_node(dlm, br->dead_node);
}
dlm_reset_recovery()
{
dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
}
will hang in dlm_remaster_locks() for
request dlm locks info
Before send FINALIZE_RECO message, recovery master should set
DLM_RECO_STATE_FINALIZE for itself and clear it after the recovery done,
this can break the race windows as the BEGIN_RECO messages will not be
handled before DLM_RECO_STATE_FINALIZE flag is cleared.
A similar race may happen between new recovery master and normal node
which is in dlm_finalize_reco_handler(), also fix it.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This issue was introduced by commit 800deef3f6 ("ocfs2: use
list_for_each_entry where benefical") in 2007 where it replaced
list_for_each with list_for_each_entry. The variable "lock" will point
to invalid data if "tmpq" list is empty and a panic will be triggered
due to this. Sunil advised reverting it back, but the old version was
also not right. At the end of the outer for loop, that
list_for_each_entry will also set "lock" to an invalid data, then in the
next loop, if the "tmpq" list is empty, "lock" will be an stale invalid
data and cause the panic. So reverting the list_for_each back and reset
"lock" to NULL to fix this issue.
Another concern is that this seemes can not happen because the "tmpq"
list should not be empty. Let me describe how.
old lock resource owner(node 1): migratation target(node 2):
image there's lockres with a EX lock from node 2 in
granted list, a NR lock from node x with convert_type
EX in converting list.
dlm_empty_lockres() {
dlm_pick_migration_target() {
pick node 2 as target as its lock is the first one
in granted list.
}
dlm_migrate_lockres() {
dlm_mark_lockres_migrating() {
res->state |= DLM_LOCK_RES_BLOCK_DIRTY;
wait_event(dlm->ast_wq, !dlm_lockres_is_dirty(dlm, res));
//after the above code, we can not dirty lockres any more,
// so dlm_thread shuffle list will not run
downconvert lock from EX to NR
upconvert lock from NR to EX
<<< migration may schedule out here, then
<<< node 2 send down convert request to convert type from EX to
<<< NR, then send up convert request to convert type from NR to
<<< EX, at this time, lockres granted list is empty, and two locks
<<< in the converting list, node x up convert lock followed by
<<< node 2 up convert lock.
// will set lockres RES_MIGRATING flag, the following
// lock/unlock can not run
dlm_lockres_release_ast(dlm, res);
}
dlm_send_one_lockres()
dlm_process_recovery_data()
for (i=0; i<mres->num_locks; i++)
if (ml->node == dlm->node_num)
for (j = DLM_GRANTED_LIST; j <= DLM_BLOCKED_LIST; j++) {
list_for_each_entry(lock, tmpq, list)
if (lock) break; <<< lock is invalid as grant list is empty.
}
if (lock->ml.node != ml->node)
BUG() >>> crash here
}
I see the above locks status from a vmcore of our internal bug.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When mounting an ocfs2 volume, it will firstly generate a file
/sys/kernel/debug/o2dlm/<uuid>/dlm_state, and then launch the dlm thread.
So the following situation will cause a null pointer dereference.
dlm_debug_init -> access file dlm_state which will call dlm_state_print ->
dlm_launch_thread
Move dlm_debug_init after dlm_launch_thread and dlm_launch_recovery_thread
can fix this issue.
Signed-off-by: Zongxun Wang <wangzongxun@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We trigger a bug in __dlm_lockres_reserve_ast() when we parallel umount 4
nodes. The situation is as follows:
1) Node A migrate all lockres it owned(eg. lockres A) to other nodes
say node B when it umounts.
2) Receiving MIG_LOCKRES message from A, Node B masters the lockres A
with DLM_LOCK_RES_MIGRATING state set.
3) Then we umount ocfs2 on node B. It also should migrate lockres A to
another node, say node C. But now, DLM_LOCK_RES_MIGRATING state of
lockers A is not cleared. Node B triggered the BUG on lockres with
state DLM_LOCK_RES_MIGRATING.
Signed-off-by: Xuejiufei <xuejiufei@huawei.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A parallel umount on 4 nodes triggered a bug in
dlm_process_recovery_date(). Here's the situation:
Receiving MIG_LOCKRES message, A node processes the locks in migratable
lockres. It copys lvb from migratable lockres when processing the first
valid lock.
If there is a lock in the blocked list with the EX level, it triggers the
BUG. Since valid lvbs are set when locks are granted with EX or PR
levels, locks in the blocked list cannot have valid lvbs. Therefore I
think we should skip the locks in the blocked list.
Signed-off-by: Xuejiufei <xuejiufei@huawei.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_do_local_recovery_cleanup() should force clean refmap if the owner of
lockres is UNKNOWN. Otherwise node may hang when umounting filesystems.
Here's the situation:
Node1 Node2
dlmlock()
-> dlm_get_lock_resource()
send DLM_MASTER_REQUEST_MSG to
other nodes.
trying to master this lockres,
return MAYBE.
selected as the master of lockresA,
set mle->master to Node1,
and do assert_master,
send DLM_ASSERT_MASTER_MSG to Node2.
Node 2 has interest on lockresA
and return
DLM_ASSERT_RESPONSE_MASTERY_REF
then something happened and
Node2 crashed.
Receiving DLM_ASSERT_RESPONSE_MASTERY_REF, set Node2 into refmap, and keep
sending DLM_ASSERT_MASTER_MSG to other nodes
o2hb found node2 down, calling dlm_hb_node_down() -->
dlm_do_local_recovery_cleanup() the master of lockresA is still UNKNOWN,
no need to call dlm_free_dead_locks().
Set the master of lockresA to Node1, but Node2 stills remains in refmap.
When Node1 umount, it found that the refmap of lockresA is not empty and
attempted to migrate it to Node2, But Node2 is already down, so umount
hang, trying to migrate lockresA again and again.
Signed-off-by: joyce <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[dan.carpenter@oracle.com: fix up some NULL dereference bugs]
Signed-off-by: Dong Fang <yp.fangdong@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_request_all_locks() should deal with the status sent from target node
if DLM_LOCK_REQUEST_MSG is sent successfully, or recovery master will fall
into endless loop, waiting for other nodes to send locks and
DLM_RECO_DATA_DONE_MSG to me.
NodeA NodeB
selected as recovery master
dlm_remaster_locks()
->dlm_request_all_locks()
send DLM_LOCK_REQUEST_MSG to nodeA
It happened that NodeA cannot alloc memory when it processes this
message. dlm_request_all_locks_handler() do not queue
dlm_request_all_locks_worker and returns -ENOMEM. It will never send
locks and DLM_RECO_DATA_DONE_MSG to NodeB.
NodeB do not deal with the status
sent from nodeA, and will fall in
endless loop waiting for the
recovery state of NodeA to be
changed.
Signed-off-by: joyce <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlmlock_master() returns DLM_RECOVERING/DLM_MIGRATING/ DLM_FORWAR after
adding lock to blocked list if lockres has the state
DLM_LOCK_RES_RECOVERING/DLM_LOCK_RES_MIGRATING/ DLM_LOCK_RES_IN_PROGRESS.
so it will retry in dlmlock(). And this may cause dlm_thread fall into an
infinite loop
Thread1 dlm_thread
calls dlm_lock->dlmlock_master,
if lockresA is in state
DLM_LOCK_RES_RECOVERING, calls
__dlm_wait_on_lockres() and waits
until others threads clear this
state;
If cannot grant this lock,
adding lock to blocked list,
and return DLM_RECOVERING;
Grant this lock and move it to
grant list;
After a while, retry and
calls list_add_tail(), adding lock
to blocked list again.
Granted and blocked list of this lockres will become the following
conditions:
lock_res->granted.next = dlm_lock->list_head;
lock_res->blocked.next = dlm_lock->list_head;
dlm_lock->list_head.next = dlm_lock_resource->blocked;
When dlm_thread traverses the granted list, it will fall into an endless
loop, checking dlm_lock.list_head, dlm_lock->list_head.next
(i.e.lock_res->blocked), lock_res->blocked.next(i.e.dlm_lock.list_head
again) .....
Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: jensen <shencanquan@huawei.com>
Cc: Jeff Liu <jeff.liu@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_begin_reco_handler() returns without putting dlm when dlm recovery
state is DLM_RECO_STATE_FINALIZE.
Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In dlm_request_all_locks, ret is type enum. But o2net_send_message
returns a type int value. Then it will never run into the following
error branch. So we should change the ret type from enum to int.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Below 3 functions have already been declared in dlmcommon.h, so we have
no need to declare them again in dlmrecovery.c:
dlm_complete_recovery_thread
dlm_launch_recovery_thread
dlm_kick_recovery_thread
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dlm_mig_lockres_handler() is missing a dlm_lockres_put() on an error path.
Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: shencanquan <shencanquan@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kfree on a NULL pointer is a no-op. Remove the redundant null pointer
check.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Acked-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
My static checker complains that this is called with a spin_lock held
in dlm_master_requery_handler() from dlmrecovery.c. Probably the reason
we have not received any bug reports about this is that recovery is not
a common operation.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (31 commits)
ocfs2: avoid unaligned access to dqc_bitmap
ocfs2: Use filemap_write_and_wait() instead of write_inode_now()
ocfs2: honor O_(D)SYNC flag in fallocate
ocfs2: Add a missing journal credit in ocfs2_link_credits() -v2
ocfs2: send correct UUID to cleancache initialization
ocfs2: Commit transactions in error cases -v2
ocfs2: make direntry invalid when deleting it
fs/ocfs2/dlm/dlmlock.c: free kmem_cache_zalloc'd data using kmem_cache_free
ocfs2: Avoid livelock in ocfs2_readpage()
ocfs2: serialize unaligned aio
ocfs2: Implement llseek()
ocfs2: Fix ocfs2_page_mkwrite()
ocfs2: Add comment about orphan scanning
ocfs2: Clean up messages in the fs
ocfs2/cluster: Cluster up now includes network connections too
ocfs2/cluster: Add new function o2net_fill_node_map()
ocfs2/cluster: Fix output in file elapsed_time_in_ms
ocfs2/dlm: dlmlock_remote() needs to account for remastery
ocfs2/dlm: Take inflight reference count for remotely mastered resources too
ocfs2/dlm: Cleanup dlm_wait_for_node_death() and dlm_wait_for_node_recovery()
...
Memory allocated using kmem_cache_zalloc should be freed using
kmem_cache_free, not kfree.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x,e,e1,e2;
@@
x = kmem_cache_zalloc(e1,e2)
... when != x = e
?-kfree(x)
+kmem_cache_free(e1,x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason. Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The cluster up check only checks to see if the node is heartbeating or not.
If yes it continues assuming that the node is connected to all the nodes. But
if that is not the case, the cluster join aborts with a stack of errors that
are not easy to comprehend.
This patch adds the network connect check upfront and prints the nodes that
the node is not yet connected to, before aborting.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
In dlmlock_remote(), we wait for the resource to stop being active before
setting the inprogress flag. Active includes recovery, migration, etc.
The problem here is that if the resource was being recovered or migrated, the
new owner could very well be that node itself (and thus not a remote node).
This problem was observed in Oracle bug#12583620. The error messages observed
were as follows:
dlm_send_remote_lock_request:337 ERROR: Error -40 (ELOOP) when sending message 503 (key 0xd6d8c7) to node 2
dlmlock_remote:271 ERROR: dlm status = DLM_BADARGS
dlmlock:751 ERROR: dlm status = DLM_BADARGS
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The inflight reference count, in the lock resource, is taken to pin the resource
in memory. We take it when a new resource is created and release it after a
lock is attached to it. We do this to prevent the resource from getting purged
prematurely.
Earlier this reference count was being taken for locally mastered resources
only. This patch extends the same functionality for remotely mastered ones.
We are doing this because the same premature purging could occur for remotely
mastered resources if the remote node were to die before completion of the
create lock.
Fix for Oracle bug#12405575.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
During dlm domain shutdown, o2dlm has to free all the lock resources. Ones that
have no locks and references are freed. Ones that have locks and/or references
are migrated to another node.
The first task in migration is finding a target. Currently we scan the lock
resource and find one node that either has a lock or a reference. This is not
very efficient in a parallel umount case as we might end up migrating the
lock resource to a node which itself may have to migrate it to a third node.
The patch scans the dlm->exit_domain_map to ensure the target node is not
leaving the domain. If no valid target node is found, o2dlm does not migrate
the resource but instead waits for the unlock and deref messages that will
allow it to free the resource.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
This patch adds a new dlm message DLM_BEGIN_EXIT_DOMAIN_MSG and ups the dlm
protocol to 1.2.
o2dlm sends this new message in dlm_unregister_domain() to mark the beginning
of the exit domain. This message is sent to all nodes in the domain.
Currently o2dlm has no way of informing other nodes of its impending exit.
This information is useful as the other nodes could disregard the exiting
node in certain operations. For example, in resource migration. If two or
more nodes were umounting in parallel, it would be more efficient if o2dlm
were to choose a non-exiting node to be the new master node rather than an
exiting one.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch cleans up the gunk added by commit 388c4bcb4e.
dlm_is_lockres_migrateable() now returns 1 if lockresource is deemed
migrateable and 0 if not.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
During resource migration, if the target node were to die, the thread doing
the migration spins until the target node is not removed from the domain map.
This patch slows the spin by making the thread wait for the recovery to kick in.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Patch fixes a bug in the o2dlm protocol negotiation in that it is using
the builtin version rather than the negotiated version during the domain
join. This causes join errors when a node having kernel >= 2.6.37 joins
a cluster with nodes having kernels < 2.6.37.
This only affects the o2cb cluster stack.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reported-by: Jacek Stepniewski <Jacek.Stepniewski@agora.pl>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Replace EXTRA_CFLAGS with ccflags-y. And change ntfs-objs to ntfs-y
for cleaner conditional inclusion.
Signed-off-by: matt mooney <mfm@muteddisk.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Michal Marek <mmarek@suse.cz>
In dlm_query_region_handler(), move the kmalloc outside the spinlock.
This allows us to use GFP_KERNEL instead of GFP_ATOMIC.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
mlog_exit is used to record the exit status of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
This patch just try to remove it or change it. So:
1. if all the error paths already use mlog_errno, it is just removed.
Otherwise, it will be replaced by mlog_errno.
2. if it is used to print some return value, it is replaced with
mlog(0,...).
mlog_exit_ptr is changed to mlog(0.
All those mlog(0,...) will be replaced with trace events later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
ENTRY is used to record the entry of a function.
But because it is added in so many functions, if we enable it,
the system logs get filled up quickly and cause too much I/O.
So actually no one can open it for a production system or even
for a test.
So for mlog_entry_void, we just remove it.
for mlog_entry(...), we replace it with mlog(0,...), and they
will be replace by trace event later.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
MAINTAINERS: Update Joel Becker's email address
ocfs2: Remove unused truncate function from alloc.c
ocfs2/cluster: dereferencing before checking in nst_seq_show()
ocfs2: fix build for OCFS2_FS_STATS not enabled
ocfs2/cluster: Show o2net timing statistics
ocfs2/cluster: Track process message timing stats for each socket
ocfs2/cluster: Track send message timing stats for each socket
ocfs2/cluster: Use ktime instead of timeval in struct o2net_sock_container
ocfs2/cluster: Replace timeval with ktime in struct o2net_send_tracking
ocfs2: Add DEBUG_FS dependency
ocfs2/dlm: Hard code the values for enums
ocfs2/dlm: Minor cleanup
ocfs2/dlm: Cleanup dlmdebug.c
ocfs2: Release buffer_head in case of error in ocfs2_double_lock.
ocfs2/cluster: Pin the local node when o2hb thread starts
ocfs2/cluster: Show pin state for each o2hb region
ocfs2/cluster: Pin/unpin o2hb regions
ocfs2/cluster: Remove dropped region from o2hb quorum region bitmap
ocfs2/cluster: Pin the remote node item in configfs
ocfs2/dlm: make existing convertion precedent over new lock
...
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Fix system inodes cache overflow.
ocfs2: Hold ip_lock when set/clear flags for indexed dir.
ocfs2: Adjust masklog flag values
Ocfs2: Teach 'coherency=full' O_DIRECT writes to correctly up_read i_alloc_sem.
ocfs2/dlm: Migrate lockres with no locks if it has a reference
In o2dlm, the enumerated message values are part of the protocol.
The patch hard codes each value so as to reduce the chance of an editing
error causing a protocol mismatch.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Patch makes use of task_pid_nr(). Also removes the null check before calling
debugfs_remove().
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch adds support for pinning o2hb regions in configfs. Pinning disallows
a region to be cleanly stopped as long as it has an active dependent user
(read o2dlm).
In local heartbeat mode, the region uuid matching the domain name is pinned as
long as the o2dlm domain is active.
In global heartbeat mode, all regions are pinned as long as there is atleast
one dependent user and the region count is 3 or less. All regions are unpinned
if the number of dependent users is zero or region count is greater than 3.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Make existing convertion precedent over new lock. It makes o2dlm locking more
like fair locking.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Add the domain name and the resource name in the mlogs.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
o2dlm was not migrating resources with zero locks because it assumed that that
resource would get purged by dlm_thread. However, some usage patterns involve
creating and dropping locks at a high rate leading to the migrate thread seeing
zero locks but the purge thread seeing an active reference. When this happens,
the dlm_thread cannot purge the resource and the migrate thread sees no reason
to migrate that resource. The spell is broken when the migrate thread catches
the resource with a lock.
The fix is to make the migrate thread also consider the reference map.
This usage pattern can be triggered by userspace on userdlm locks and flocks.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
coccinelle check scripts/coccinelle/locks/call_kern.cocci found that
in fs/ocfs2/dlm/dlmdomain.c an allocation with GFP_KERNEL is done
with locks held:
dlm_query_region_handler
spin_lock(dlm_domain_lock)
dlm_match_regions
kmalloc(GFP_KERNEL)
Change it to GFP_ATOMIC.
Signed-off-by: David Sterba <dsterba@suse.cz>
CC: Joel Becker <joel.becker@oracle.com>
CC: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
--
Exists in v2.6.37-rc1 and current linux-next.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
dlm protocol 1.1. activates messages DLM_QUERY_REGION and DLM_QUERY_NODEINFO
that are a must for global heartbeat.
It also activates o2hb_global_heartbeat_active().
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Adds new dlm message DLM_QUERY_NODEINFO that sends the attributes of all
registered nodes. This message is sent if the negotiated dlm protocol is
1.1 or higher. If the information of the joining node does not match
that of any existing nodes, the join domain request is rejected.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Adds new dlm message DLM_QUERY_REGION that sends the names of all active
heartbeat regions. This message is only sent in the global heartbeat
mode. If the regions in the joining node do not fully match the ones in
the active nodes, the join domain request is rejected.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
While umounting, a block mle doesn't get freed if dlm is shutdown after
master request is received but before assert master. This results in unclean
shutdown of dlm domain.
This patch frees all mles that lie around after other nodes were notified about
exiting the dlm and marking dlm state as leaving. Only block mles are expected
to be around, so we log ERROR for other mles but still free them.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch tries to handle the case in which list 'dlm->tracking_list' is
empty, to avoid accessing an invalid pointer. It fixes the following oops:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1287
Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
O2net: Disallow o2net accept connection request from itself.
ocfs2/dlm: remove potential deadlock -V3
ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
Fix the nested PR lock calling issue in ACL
ocfs2: Count more refcount records in file system fragmentation.
ocfs2 fix o2dlm dlm run purgelist (rev 3)
ocfs2/dlm: fix a dead lock
ocfs2: do not overwrite error codes in ocfs2_init_acl
When we need to take both dlm_domain_lock and dlm->spinlock, we should take
them in order of: dlm_domain_lock then dlm->spinlock.
There is pathes disobey this order. That is calling dlm_lockres_put() with
dlm->spinlock held in dlm_run_purge_list. dlm_lockres_put() calls dlm_put() at
the ref and dlm_put() locks on dlm_domain_lock.
Fix:
Don't grab/put the dlm when the initialising/releasing lockres.
That grab is not required because we don't call dlm_unregister_domain()
based on refcount.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In the following situation, there remains an incorrect bit in refmap on the
recovery master. Finally the recovery master will fail at purging the lockres
due to the incorrect bit in refmap.
1) node A has no interest on lockres A any longer, so it is purging it.
2) the owner of lockres A is node B, so node A is sending de-ref message
to node B.
3) at this time, node B crashed. node C becomes the recovery master. it recovers
lockres A(because the master is the dead node B).
4) node A migrated lockres A to node C with a refbit there.
5) node A failed to send de-ref message to node B because it crashed. The failure
is ignored. no other action is done for lockres A any more.
For mormal, re-send the deref message to it to recovery master can fix it. Well,
ignoring the failure of deref to the original master and not recovering the lockres
to recovery master has the same effect. And the later is simpler.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch fixes two problems in dlm_run_purgelist
1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.
2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.
This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When we have to take both dlm->master_lock and lockres->spinlock,
take them in order
lockres->spinlock and then dlm->master_lock.
The patch fixes a violation of the rule.
We can simply move taking dlm->master_lock to where we have dropped res->spinlock
since when we access res->state and free mle memory we don't need master_lock's
protection.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
dlm_migration_can_proceed() for that purpose. However, if the node is
down, dlm_migration_can_proceed() will also return "go ahead". In this
rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
the BUG_ON() that trips over this condition.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This function is only called from one place and it's like this:
dlm_register_domain(conn->cc_name, dlm_key, &fs_version);
The "conn->cc_name" is 64 characters long. If strlen(conn->cc_name)
were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
strlen() doesn't count the NULL character.
In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
64 character buffers. The only exception is nd_name from struct
o2nm_node.
Anyway I looked into it and in this case the domain string comes from
osb->uuid_str in ocfs2_setup_osb_uuid(). That's 32 characters and NULL
which easily fits into O2NM_MAX_NAME_LEN. This patch doesn't change how
the code works, but I think it makes the code a little cleaner.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>