[Linux-cluster] GFS hangs, nodes die

Wed Aug 22 08:06:21 UTC 2007

Hi Marc,

yesterday the same problem rised again, and I could observe the
counters. Btw, I'm using the newest version of RHCS/GFS (GFS-kernel-smp
2.6.9-72.2, GFS 6.1.14-0, rgmanager 1.9.68-1, cman-kernel-smp
2.6.9-50.2, cman 1.0.17-0). on one node, I have 8GB, on the others 4GB
RAM. The locks didnt change over time.

Thanks!

Sebastian

gfs_tool counters /global/home

                                  locks 2041
                             locks held 28
                           freeze count 0
                          incore inodes 20
                       metadata buffers 2
                        unlinked inodes 0
                              quota IDs 0
                     incore log buffers 0
                         log space used 0.10%
              meta header cache entries 1
                     glock dependencies 1
                 glocks on reclaim list 0
                              log wraps 0
                   outstanding LM calls 65
                  outstanding BIO calls 0
                       fh2dentry misses 0
                       glocks reclaimed 386
                         glock nq calls 214090
                         glock dq calls 214002
                   glock prefetch calls 148
                          lm_lock calls 364
                        lm_unlock calls 234
                           lm callbacks 593
                     address operations 0
                      dentry operations 46654
                      export operations 0
                        file operations 90629
                       inode operations 94213
                       super operations 173031
                          vm operations 0
                        block I/O reads 366
                       block I/O writes 292

ps axwwww | sort -k4 -n | tail -10
 6771 ?        S      0:00 [gfs_quotad]
 6772 ?        S      0:00 [gfs_inoded]
30527 ?        Ss     0:00 sshd: root at pts/0 
30529 pts/0    Ds+    0:00 -bash
17499 ?        Ss     1:15 /usr/sbin/gmond
 3796 ?        Sl     2:32 /usr/sbin/gmetad
 4251 ?        Sl     2:17 /opt/gridengine/bin/lx26-amd64/sge_qmaster
 4270 ?        Sl     5:33 /opt/gridengine/bin/lx26-amd64/sge_schedd
 3606 ?        Ss    14:50 /opt/rocks/bin/python /opt/rocks/bin/greceptor
 1802 ?        R    357:43 df -hP

cat /proc  /cluster/services

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           5   2 recover 4 -
[3 2 1 11 5 9 6 10 7 8]

DLM Lock Space:  "clvmd"                             7   3 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

DLM Lock Space:  "Magma"                            17   5 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

DLM Lock Space:  "homeneu"                          19   6 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

GFS Mount Group: "homeneu"                          21   7 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

User:            "usrm::manager"                    16   4 recover 0 -
[3 2 1 11 5 9 6 10 7 8]

Marc Grimme wrote:
> On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
>   
>> Hi,
>>
>> Marc Grimme wrote:
>>     
>>> Do you also see some messages on the console of the nodes. And the
>>> gfs_tool
>>> counters would help before that problem occures. So let it run sometimes
>>> before to see if locks increase.
>>> What kind of stress tests are you doing? I bet searching the whole
>>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
>>> not work whereas it worked for me with exactly the same problems. Did you
>>> set it _AFTER_ the fs was mounted?
>>>       
> Sorry I mean after is right and before not ;-( .
> And are you using the latest version of CS/GFS?
> Do you have a lot of memory in your machines 16G or more?
>   
>> That makes me optimistic. I set it after the volume was mounted, so I
>> will give it another try setting it before mounting it. Then I will also
>> mail myself the output of the counters every 10 minuts. Let's see...
>>     
> I would be interested in the counters.
> Also add the process list in order to see if how much CPU-Time gfs_scand 
> consumes.
> i.e.
> ps axwwww | sort -k4 -n | tail -10
>
> Have fun Marc.
>   
>> ...with best thanks
>> Sebastian
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
>
>
>