[Linux-cluster] GFS hangs, nodes die
Sebastian Walter
sebastian.walter at fu-berlin.de
Wed Aug 22 08:06:21 UTC 2007
Hi Marc,
yesterday the same problem rised again, and I could observe the
counters. Btw, I'm using the newest version of RHCS/GFS (GFS-kernel-smp
2.6.9-72.2, GFS 6.1.14-0, rgmanager 1.9.68-1, cman-kernel-smp
2.6.9-50.2, cman 1.0.17-0). on one node, I have 8GB, on the others 4GB
RAM. The locks didnt change over time.
Thanks!
Sebastian
gfs_tool counters /global/home
locks 2041
locks held 28
freeze count 0
incore inodes 20
metadata buffers 2
unlinked inodes 0
quota IDs 0
incore log buffers 0
log space used 0.10%
meta header cache entries 1
glock dependencies 1
glocks on reclaim list 0
log wraps 0
outstanding LM calls 65
outstanding BIO calls 0
fh2dentry misses 0
glocks reclaimed 386
glock nq calls 214090
glock dq calls 214002
glock prefetch calls 148
lm_lock calls 364
lm_unlock calls 234
lm callbacks 593
address operations 0
dentry operations 46654
export operations 0
file operations 90629
inode operations 94213
super operations 173031
vm operations 0
block I/O reads 366
block I/O writes 292
ps axwwww | sort -k4 -n | tail -10
6771 ? S 0:00 [gfs_quotad]
6772 ? S 0:00 [gfs_inoded]
30527 ? Ss 0:00 sshd: root at pts/0
30529 pts/0 Ds+ 0:00 -bash
17499 ? Ss 1:15 /usr/sbin/gmond
3796 ? Sl 2:32 /usr/sbin/gmetad
4251 ? Sl 2:17 /opt/gridengine/bin/lx26-amd64/sge_qmaster
4270 ? Sl 5:33 /opt/gridengine/bin/lx26-amd64/sge_schedd
3606 ? Ss 14:50 /opt/rocks/bin/python /opt/rocks/bin/greceptor
1802 ? R 357:43 df -hP
cat /proc /cluster/services
Service Name GID LID State Code
Fence Domain: "default" 5 2 recover 4 -
[3 2 1 11 5 9 6 10 7 8]
DLM Lock Space: "clvmd" 7 3 recover 0 -
[3 2 1 11 5 9 6 10 7 8]
DLM Lock Space: "Magma" 17 5 recover 0 -
[3 2 1 11 5 9 6 10 7 8]
DLM Lock Space: "homeneu" 19 6 recover 0 -
[3 2 1 11 5 9 6 10 7 8]
GFS Mount Group: "homeneu" 21 7 recover 0 -
[3 2 1 11 5 9 6 10 7 8]
User: "usrm::manager" 16 4 recover 0 -
[3 2 1 11 5 9 6 10 7 8]
Marc Grimme wrote:
> On Tuesday 21 August 2007 09:52:32 Sebastian Walter wrote:
>
>> Hi,
>>
>> Marc Grimme wrote:
>>
>>> Do you also see some messages on the console of the nodes. And the
>>> gfs_tool
>>> counters would help before that problem occures. So let it run sometimes
>>> before to see if locks increase.
>>> What kind of stress tests are you doing? I bet searching the whole
>>> filesystem. What makes me wonder is that the gfs_tool glock_purge does
>>> not work whereas it worked for me with exactly the same problems. Did you
>>> set it _AFTER_ the fs was mounted?
>>>
> Sorry I mean after is right and before not ;-( .
> And are you using the latest version of CS/GFS?
> Do you have a lot of memory in your machines 16G or more?
>
>> That makes me optimistic. I set it after the volume was mounted, so I
>> will give it another try setting it before mounting it. Then I will also
>> mail myself the output of the counters every 10 minuts. Let's see...
>>
> I would be interested in the counters.
> Also add the process list in order to see if how much CPU-Time gfs_scand
> consumes.
> i.e.
> ps axwwww | sort -k4 -n | tail -10
>
> Have fun Marc.
>
>> ...with best thanks
>> Sebastian
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
>
More information about the Linux-cluster
mailing list