[lvm-devel] [PATCH] clvmd: closedown the cluster after finishing of lvm_thread

Fri Nov 29 06:06:39 UTC 2013

于 2013年11月28日 21:57, Zdenek Kabelac 写道:
> Dne 27.11.2013 09:56, dongmao zhang napsal(a):
>> when lvm_thread is processing remote request, the clvmd
>> received a SIG_TERM, it will free cluster resource before
>> the realwork of lvm_thread is done. If freeing the cluster
>> resource happens before send_message, it would cause the
>> remote command hangs forever.
>>
>> this patch move closedown after the closing the working thread.
>> ---
>> daemons/clvmd/clvmd.c | 3 ++-
>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/daemons/clvmd/clvmd.c b/daemons/clvmd/clvmd.c
>> index d57c0fd..b2f7dd5 100644
>> --- a/daemons/clvmd/clvmd.c
>> +++ b/daemons/clvmd/clvmd.c
>> @@ -621,6 +621,8 @@ int main(int argc, char *argv[])
>> if ((errno = pthread_join(lvm_thread, NULL)))
>> log_sys_error("pthread_join", "");
>>
>> + clops->cluster_closedown();
>> +
>> close_local_sock(local_sock);
>> destroy_lvm();
>>
>> @@ -979,7 +981,6 @@ static void main_loop(int local_sock, int 
>> cmd_timeout)
>> }
>>
>> closedown:
>> - clops->cluster_closedown();
>> if (quit)
>> DEBUGLOG("SIGTERM received\n");
>> }
>
>
> It's not clear to me how this code move helps to anything.
>
> You just moved call of clops->cluster_closedown(); after joining thread?
>
> In which code path this patch is changing something ?
>
> Zdenek
>
>

hi Zdenek,
thank you for you reply. The main idea is that the lvm_thread_fn is 
using cluster resources(such as using cpg_handler in send_message), we 
could not free cluster resource until lvm_thread_fn finishs.

The 'lvm_thread_fn' thread is doing 'process_work_item' in which it will 
send reply message(cluster_send_message) back
to remote nodes. The cluster_send_message is using the cluster resource. 
So it means we can not free the cluster resource before lvm_thread_fn 
really is finished. The cluster_closedown in the main thread could 
possibly happen before lvm_thread_fn thread calls send_message.

If so, it could cause a sending message failure, moreover, the remote 
node can not get the response, it has to wait a timeout to finish.

I met a bug like this: two nodes with VG resource.
1. NodeA runs 'rcopenais stop'
2. NodeB runs 'vgscan'

in some time, vgscan could hang for a while waiting all cluster nodes' 
response.
Because unfortunately clvmd on NodeA can not send back message because 
cluster_closedown happens before send_message.

Dongmao Zhang