[Linux-cluster] GFS hangs after several hours

Brynnen R Owen owen at isrl.uiuc.edu
Fri Nov 12 16:06:49 UTC 2004


More information.

I may have had an old version of ccsd which allowed me to get the
cluster running in the first place.  I can't get that far now.

I have IPv6 compiled in the kernel but no IPv6 interfaces defined.
I've given ccsd the -4 flag.

Checking logs after "ccs_test connect" shows that ccsd does not
believe the cluster is quorate.

/etc/cluster/status says that the cluster has reached quorum.  The IP
addresses are appropriate (I have dual-NIC hosts).

I recompiled ccsd with "DEBUG=1" and found that the "quorate" variable
was never set in ccsd.  I further found that cluster_communicator()
never received a valid fd from clu_connect and was therefore stuck in
a loop.  clu_connect appears to be a magma call.

Any advice on how to proceed?

On Thu, Nov 11, 2004 at 12:07:18PM -0600, Brynnen R Owen wrote:
> Hi all,
> 
> My setup:
> 
> 5 Athlon servers
> 
> RedHat 9.0 (Yeah, I still haven't upgraded yet)
> 
> kernel-2.6.9 from kernel.org, patched with gfs/ccs/dlm from the
> .tar.gz repository.
> 
> using lock_dlm
> 
> Using Apple XServe RAIDs with Apple FC cards (mptscsih driver).
> 
>   I thought I had everything running properly.  I had two machines
> hammering a GFS partition at the same time.  I pulled the power cord
> on one.  fence_vixel kicked in, and the rest of the cluster
> continued.  I could repeat this over and over.
> 
>   I set up two machines, each writing to a different GFS overnight.
> In the morning, there were no errors but one process was hung in a "D"
> state.  The fence system did not show any activity.  No errors were
> logged anywhere on the cluster.  'df' hung on any machine in the
> cluster when it came to one of the GFS partitions.  I shut down the
> ethernet on one of the machines, but it didn't get fenced.  It seems
> that something silently died, but I don't really know where to begin
> looking, as I don't see any errors written anywhere.  Anyone got any
> ideas?
> 
>   The only other note is that CCSD appeared to be having some problems
> with determining if the cluster had quorum.
> 
> -- 
> <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
> <>  Brynnen Owen            (     this space for rent                      )<>
> <>  owen at uiuc.edu           (                                              )<>
> <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> http://www.redhat.com/mailman/listinfo/linux-cluster

-- 
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>  Brynnen Owen            (     this space for rent                      )<>
<>  owen at uiuc.edu           (                                              )<>
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>




More information about the Linux-cluster mailing list