[Linux-cluster] GFS locking issues

Anton Kornev akornev at gmail.com
Fri Jun 23 00:19:52 UTC 2006


David,

Thanks a lot for your comments.
Actually it sounds rather strange for me.

I tried to grep the /var/log/messages log with "gnbd" word and found that
there are also
other messages like this even on the working host with no GFS problems.

bash-3.00# grep gnbd /var/log/messages
Jun 19 08:16:20 node1 kernel: gnbd (pid 25756: alogc.pl) got signal 9
Jun 19 08:16:20 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 08:16:20 node1 kernel: gnbd (pid 25756: alogc.pl) got signal 15
Jun 19 08:16:20 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 09:51:59 node1 kernel: gnbd (pid 26259: find) got signal 9
Jun 19 09:51:59 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 10:06:39 node1 kernel: gnbd (pid 313: alogc.pl) got signal 9
Jun 19 10:06:39 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 10:06:39 node1 kernel: gnbd (pid 313: alogc.pl) got signal 15
Jun 19 10:06:39 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 12:51:12 node1 kernel: gnbd (pid 19463: vi) got signal 1
Jun 19 12:51:12 node1 kernel: gnbd0: Send control failed (result -4)
Jun 20 14:48:16 node1 kernel: gnbd (pid 20238: alogc.pl) got signal 9
Jun 20 14:48:16 node1 kernel: gnbd0: Send control failed (result -4)
Jun 20 14:48:16 node1 kernel: gnbd (pid 20238: alogc.pl) got signal 15
Jun 20 14:48:16 node1 kernel: gnbd0: Send control failed (result -4)


I tried to check gnbd-kernel sources (latest available SRPM - not CVS
version)
and I found that the first message (gnbd ... got signal) is produced by the
sock_xmit() function with the such a piece of code:

                if (signal_pending(current)) {
                        siginfo_t info;
                        spin_lock_irqsave(&current->sighand->siglock,
flags);
                        printk(KERN_WARNING "gnbd (pid %d: %s) got signal
%d\n",
                                current->pid, current->comm,
                                dequeue_signal(current, &current->blocked,
&info));
                        spin_unlock_irqrestore(&current->sighand->siglock,
flags);
                        result = -EINTR;
                        break;
                }

And the second message is generated inside the gnbd_send_req() by the code

        result = sock_xmit(sock, 1, &request, sizeof(request),
                        (gnbd_cmd(req) == GNBD_CMD_WRITE)? MSG_MORE: 0);
        if (result < 0) {
                printk(KERN_ERR "%s: Send control failed (result %d)\n",
                                dev->disk->disk_name, result);
                goto error_out;
        }


So at the first glance it seems like a normal messages from gnbd - if there
is signal received during sock_xmit - don't send anyting and return -EINTR.

I am not sure that it might be a problem but I take a look on the
sock_xmit() code and
there are at least two things that seems strange for me.

1. There is an inconsistancy between comment and code:

        /* Allow interception of SIGKILL only
         * Don't allow other signals to interrupt the transmission */
        spin_lock_irqsave(&current->sighand->siglock, flags);
        oldset = current->blocked;
        sigfillset(&current->blocked);
        sigdelsetmask(&current->blocked, sigmask(SIGKILL) | sigmask(SIGTERM)
|
                      sigmask(SIGHUP));
        recalc_sigpending();
        spin_unlock_irqrestore(&current->sighand->siglock, flags);

So, inside the comment there is a suggestion that only SIGKILL can interrupt
the transmission but the real mask is for KILL/TERM/HUP signals (btw: in my
case it is a SIGTERM who locks everything).

2. There are two blocks of code following each other

                if (send)
                        result = sock_sendmsg(sock, &msg, size);
                else
                        result = sock_recvmsg(sock, &msg, size, 0);


                if (signal_pending(current)) {
                        siginfo_t info;
                        spin_lock_irqsave(&current->sighand->siglock,
flags);
                        printk(KERN_WARNING "gnbd (pid %d: %s) got signal
%d\n",
                                current->pid, current->comm,
                                dequeue_signal(current, &current->blocked,
&info));
                        spin_unlock_irqrestore(&current->sighand->siglock,
flags);
                        result = -EINTR;
                        break;
                }

Why do we need to return -EINTR as a result if we have already done the real
sock_sendmsg() / sock_recvmsg()?  What if the real transmission was okay and
real result has no mistake?

I am not a kernel developer and I haven't spent a lot of time on the issue,
so it might make no sense at all.

Please, let me know what do you think about it?

On 6/21/06, David Teigland <teigland at redhat.com> wrote:
>
> On Fri, Jun 16, 2006 at 06:37:14PM +0300, Anton Kornev wrote:
> > gnbd (pid 5836: alogc.pl) got signal 9
> > gnbd0: Send control failed (result -4)
> > gnbd (pid 5836: alogc.pl) got signal 15
> > gnbd0: Send control failed (result -4)
>
> This and the fact that a number of processes appear to be blocked in the
> i/o path seem to point at gnbd as the hold-up.
>
> Dave
>
>
> >   51 D wait_on_buffer                   pdflush
> > 5771 D lock_page                        lock_dlm1
> > 5776 D -                                gfs_logd
> > 5777 D -                                gfs_quotad
> > 5778 D -                                gfs_inoded
> > 5892 D -                                httpd
> > 5895 D glock_wait_internal              httpd
> > 5896 D glock_wait_internal              httpd
> > 5897 D glock_wait_internal              httpd
> > 5911 D glock_wait_internal              httpd
> > 5915 D wait_on_buffer                   httpd
> > 5930 D wait_on_buffer                   sh
>
> > pdflush       D ffffffff8014aabc     0    51      6            53    50
> > (L-TLB)
> > 00000100dfc3dc78 0000000000000046 000001011bd3e980 000001010fc11f00
> >       0000000000000216 ffffffffa0042916 000001011aca60c0
> 0000000000000008
> >       000001011fdef7f0 0000000000000dfa
> > Call Trace:<ffffffffa0042916>{:dm_mod:dm_request+396}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> >       <ffffffff803053ef>{io_schedule+38}
> > <ffffffff80178c4c>{__wait_on_buffer+125}
> >       <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffff80178ad2>{bh_wake_function+0}
> >       <ffffffffa0235c5d>{:gfs:gfs_logbh_wait+49}
> > <ffffffffa024a6a6>{:gfs:disk_commit+794}
> >       <ffffffffa024a877>{:gfs:log_refund+111}
> > <ffffffffa024ad8e>{:gfs:log_flush_internal+510}
> >       <ffffffff8017d682>{sync_supers+167}
> <ffffffff8015e310>{wb_kupdate+36}
> >
> >       <ffffffff8015edb4>{pdflush+323} <ffffffff8015e2ec>{wb_kupdate+0}
> >       <ffffffff8015ec71>{pdflush+0} <ffffffff8014aa93>{kthread+200}
> >       <ffffffff80110e17>{child_rip+8}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> >       <ffffffff8014a9cb>{kthread+0} <ffffffff80110e0f>{child_rip+0}
> > lock_dlm1     D 000001000c0096e0     0  5771      6          5772  5766
> > (L-TLB)
> > 0000010113ce3c58 0000000000000046 0000001000000000 0000010000000069
> >       000001011420b030 0000000000000069 000001000c00a940
> 000000010000eb10
> >       000001011a887030 0000000000001cae
> > Call Trace:<ffffffff802496d4>{__generic_unplug_device+19}
> > <ffffffff803053ef>{io_schedule+38}
> >       <ffffffff80159215>{__lock_page+191}
> > <ffffffff80158cfa>{page_wake_function+0}
> >       <ffffffff80158cfa>{page_wake_function+0}
> > <ffffffff80163125>{truncate_inode_pages+519}
> >       <ffffffffa0258f35>{:gfs:gfs_inval_page+63}
> > <ffffffffa02401b5>{:gfs:drop_bh+233}
> >       <ffffffffa0242138>{:gfs:gfs_glock_cb+194}
> > <ffffffffa02869dd>{:lock_dlm:dlm_async+1989}
> >       <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> >       <ffffffffa0286218>{:lock_dlm:dlm_async+0}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> >       <ffffffff8014aa93>{kthread+200} <ffffffff80110e17>{child_rip+8}
> >       <ffffffff8014aabc>{keventd_create_kthread+0}
> > <ffffffff8014a9cb>{kthread+0}
> >       <ffffffff80110e0f>{child_rip+0}
> > gfs_logd      D 0000000000000000     0  5776      1          5777  5775
> > (L-TLB)
> > 000001011387fe38 0000000000000046 0000000000000000 ffffffff80304a85
> >       000001011387fe58 ffffffff80304add ffffffff803cca80
> 0000000000000246
> >       00000101143fe030 00000000000000b5
> > Call Trace:<ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> >       <ffffffffa023e8d3>{:gfs:lock_on_glock+112}
> > <ffffffff8030565b>{__down_write+134}
> >       <ffffffffa0249cdb>{:gfs:gfs_ail_empty+56}
> > <ffffffffa0233930>{:gfs:gfs_logd+77}
> >       <ffffffff80110e17>{child_rip+8}
> > <ffffffff801cccff>{dummy_d_instantiate+0}
> >       <ffffffffa02338e3>{:gfs:gfs_logd+0}
> <ffffffff80110e0f>{child_rip+0}
> >
> > gfs_quotad    D 0000000000000000     0  5777      1          5778  5776
> > (L-TLB)
> > 0000010113881e98 0000000000000046 0000000000000000 ffffffff80304a85
> >       0000010113881eb8 ffffffff80304add 000001011ff87030
> 0000000100000074
> >       000001011430f7f0 0000000000000128
> > Call Trace:<ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> >       <ffffffff8030565b>{__down_write+134}
> > <ffffffffa025b55a>{:gfs:gfs_quota_sync+226}
> >       <ffffffffa0233ab1>{:gfs:gfs_quotad+127}
> > <ffffffff80110e17>{child_rip+8}
> >       <ffffffff801cccff>{dummy_d_instantiate+0}
> > <ffffffff801cccff>{dummy_d_instantiate+0}
> >       <ffffffff801cccff>{dummy_d_instantiate+0}
> > <ffffffffa0233a32>{:gfs:gfs_quotad+0}
> >       <ffffffff80110e0f>{child_rip+0}
> > gfs_inoded    D 0000000000000000     0  5778      1          5807  5777
> > (L-TLB)
> > 0000010113883e98 0000000000000046 000001011e2937f0 000001000c0096e0
> >       0000000000000000 ffffffff80304a85 0000010113883ec8
> 0000000180304add
> >       000001011e2937f0 00000000000000c2
> > Call Trace:<ffffffff80304a85>{thread_return+0}
> > <ffffffff8030565b>{__down_write+134}
> >       <ffffffffa026160d>{:gfs:unlinked_find+115}
> > <ffffffffa0261c6c>{:gfs:gfs_unlinked_dealloc+25}
> >       <ffffffffa0233bd5>{:gfs:gfs_inoded+66}
> > <ffffffff80110e17>{child_rip+8}
> >       <ffffffffa0233b93>{:gfs:gfs_inoded+0}
> <ffffffff80110e0f>{child_rip+0}
> >
> >
> > httpd         D ffffffff80304190     0  5892      1  5893          5826
> > (NOTLB)
> > 0000010111b75bf8 0000000000000002 0000000000000001 0000000000000001
> >       0000000000000000 0000000000000000 0000010114667980
> 0000000111b75bc0
> >       00000101143fe7f0 00000000000009ad
> > Call Trace:<ffffffff80303d6f>{__down+147}
> > <ffffffff801333c8>{default_wake_function+0}
> >       <ffffffff8015b3a2>{generic_file_write_nolock+158}
> > <ffffffff80305780>{__down_failed+53}
> >       <ffffffffa0236986>{:gfs:.text.lock.dio+95}
> > <ffffffffa0260e4c>{:gfs:gfs_trans_add_bh+205}
> >       <ffffffffa0253efc>{:gfs:do_write_buf+1138}
> > <ffffffffa0252db3>{:gfs:walk_vm+278}
> >       <ffffffffa0253a8a>{:gfs:do_write_buf+0}
> > <ffffffffa0253a8a>{:gfs:do_write_buf+0}
> >       <ffffffffa025415b>{:gfs:__gfs_write+201}
> > <ffffffff80177c60>{vfs_write+207}
> >       <ffffffff80177d48>{sys_write+69}
> <ffffffff801101c6>{system_call+126}
> >
> > httpd         D 0000010110ad7d48     0  5895   5892          5896  5893
> > (NOTLB)
> > 0000010110ad7bd8 0000000000000006 000001011b16e030 0000000000000075
> >       0000010117002030 0000000000000075 000001000c002940
> 0000000000000001
> >       00000101170027f0 000000000001300e
> > Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> > <ffffffff80304cbd>{wait_for_completion+167}
> >       <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> >       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> >       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> >       <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> >       <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> >       <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> >       <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > httpd         D 0000010110b5bd48     0  5896   5892          5897  5895
> > (NOTLB)
> > 0000010110b5bbd8 0000000000000002 00000101170027f0 0000000000000075
> >       00000101114787f0 0000000000000075 000001000c002940
> 0000000000000001
> >       0000010117002030 000000000000fb3e
> > Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> > <ffffffff80304cbd>{wait_for_completion+167}
> >       <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> >       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> >       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> >       <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> >       <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> >       <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff802a3445>{sys_accept+327}
> >       <ffffffff80182e88>{pipe_read+26} <ffffffff80110c61>{error_exit+0}
> >
> > httpd         D 0000000000000000     0  5897   5892          5911  5896
> > (NOTLB)
> > 0000010110119bd8 0000000000000006 0000010117002030 0000000000000075
> >       0000010117002030 0000000000000075 000001000c00a940
> 000000001b16e030
> >       00000101114787f0 000000000000fbe0
> > Call Trace:<ffffffff802496d4>{__generic_unplug_device+19}
> > <ffffffff80304cbd>{wait_for_completion+167}
> >       <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> >       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> >       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> >       <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> >       <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> >       <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> >       <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > httpd         D 00000101100c3d48     0  5911   5892          5915  5897
> > (NOTLB)
> > 00000101100c3bd8 0000000000000002 000001011420b7f0 0000000000000075
> >       00000101170027f0 0000000000000075 000001000c002940
> 0000000000000000
> >       000001011b16e030 000000000000187e
> > Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> > <ffffffff80304cbd>{wait_for_completion+167}
> >       <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> >       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> >       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> >       <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> >       <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> >       <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> >       <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > httpd         D 0000000000006a36     0  5915   5892                5911
> > (NOTLB)
> > 00000101180f7ad8 0000000000000006 0000000000002706 ffffffffa020c791
> >       0000000000000000 0000000000000000 0000030348ac8c1c
> 0000000114a217f0
> >       0000010114c997f0 000000000000076a
> > Call Trace:<ffffffffa020c791>{:dlm:lkb_swqueue+43}
> > <ffffffff803053ef>{io_schedule+38}
> >       <ffffffff80178c4c>{__wait_on_buffer+125}
> > <ffffffff80178ad2>{bh_wake_function+0}
> >       <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffffa02352c6>{:gfs:gfs_dreread+154}
> >       <ffffffffa0235332>{:gfs:gfs_dread+40}
> > <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
> >       <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
> > <ffffffffa0242461>{:gfs:inode_go_lock+38}
> >       <ffffffffa023f586>{:gfs:glock_wait_internal+563}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> >       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> >       <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> >       <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> >       <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> >       <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > sh            D 000000000000001a     0  5930   2547
> > (NOTLB)
> > 000001011090f8e8 0000000000000002 0000010111293d88 0000010110973d00
> >       0000010111293d88 0000000000000000 00000100dfc02400
> 0000000000010000
> >       00000101148557f0 0000000000002010
> > Call Trace:<ffffffff803053ef>{io_schedule+38}
> > <ffffffff80178c4c>{__wait_on_buffer+125}
> >       <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffff80178ad2>{bh_wake_function+0}
> >       <ffffffffa02352c6>{:gfs:gfs_dreread+154}
> > <ffffffffa0235332>{:gfs:gfs_dread+40}
> >       <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
> > <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
> >       <ffffffffa0242461>{:gfs:inode_go_lock+38}
> > <ffffffffa023f586>{:gfs:glock_wait_internal+563}
> >       <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> >       <ffffffff801ccb78>{dummy_inode_permission+0}
> > <ffffffffa0257aca>{:gfs:gfs_permission+64}
> >       <ffffffff8018d475>{dput+56} <ffffffff80183d32>{permission+51}
> >       <ffffffff801844aa>{__link_path_walk+372}
> > <ffffffff801851c2>{link_path_walk+82}
> >       <ffffffff8012370b>{do_page_fault+575}
> > <ffffffff801849b0>{__link_path_walk+1658}
> >       <ffffffff801851c2>{link_path_walk+82}
> > <ffffffff8012370b>{do_page_fault+575}
> >       <ffffffff8018540f>{path_lookup+451}
> > <ffffffff801856bb>{__user_walk+47}
> >       <ffffffff8017ff1a>{vfs_stat+24}
> <ffffffff8012370b>{do_page_fault+575}
> >
> >       <ffffffff80180264>{sys_newstat+17}
> <ffffffff80110c61>{error_exit+0}
> >       <ffffffff801101c6>{system_call+126}
>
>


-- 
Best Regards,
Anton Kornev.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060623/a47a727c/attachment.htm>


More information about the Linux-cluster mailing list