[Linux-cluster] GFS hangs, nodes die

Mon Aug 20 07:42:03 UTC 2007

> [...cut...]
> certain level (heavy I/O), it collapses. About 20% of the nodes do crash
> (not reacting any more, but no sign of kernel panic), the others can't
> access the gfs resource.
> [...cut...]
> 
> [root at compute-0-6 ~]# cat /proc/cluster/services
> Service          Name                              GID LID State     Code
> Fence Domain:    "default"                           3   2 recover 4 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "clvmd"                             7   3 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "Magma"                            12   5 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]
> DLM Lock Space:  "homeneu"                          17   6 recover 0 -
> [10 9 8 7 2 3 6 4 1 11]
> GFS Mount Group: "homeneu"                          18   7 recover 0 -
> [10 9 8 7 2 3 6 4 1 11]
> User:            "usrm::manager"                    11   4 recover 0 -
> [1 2 6 10 9 8 3 7 4 11]

Hello,

1. Do You have Fibre-Channel SAN storage or use GNDB/iSCSI?
2. The other nodes can't access GFS fs because cluster is in recover
state. Do You have fencing properly configured?

Best Regards
Maciej Bogucki