[Linux-cluster] data and machine dependent NFS GFS file xfer problem

Fri Feb 23 16:15:28 UTC 2007

Keith Lewis wrote:
> Thursday 22 Feb 2007
>
> 	An attempt was made to make sure all the computers in a certain group
> had a common set of rpms installed.
>
> 	To make this easier, a non-RedHat rpm was copied to a disk that was
> mounted on most of the machines, and installed from there.  This broke the RPM
> database on those machines.  After the install, most rpm commands got this:
>
> error: rpmdbNextIterator: skipping h# 2325 Header V3 DSA signature: BAD, key
> ID 99b62126
>
> 	This was fixed by rpm -e <the above rpm> ; rpm --rebuilddb
>
> 	It was noticed that all the machines which had installed the shared
> rpm had failed in this way, but none of the machines that had installed from a
> copy on local disk.
>
> 	Using `sum' it was noticed that all the machines except one saw the
> file as corrupt.  The one machine, (called `T' from here on), was the one
> which had done the original copy.  It still saw the file as pristine - i.e.
> not corrupt.
>
> 	The shared filesystem is based on GFS, but due to a history of network
> and SAN problems causing fence events which seriously degrade our
> servicelevel, GFS is restricted to as few machines as possible.  Currently
> only three machines, (called C, S and W from here on), mount the GFS disk
> directly.  Machines C and W export it to the rest of the group via NFS.
>
> 	`T' mounted the GFS disk via NFS through W.  `T' was the only machine
> to see the GFS copy as pristine.  All other machines, including C, S and W,
> irrespective of whether they mounted the disk by GFS directly or by NFS saw
> the file as corrupt.
>
> 	`T' then dismounted the disk via W and remounted it via C.  It then saw
> the file as corrupt, but it then made another copy of the file from its local
> disk to the GFS disk, and this copy too was seen as corrupt by all other
> machines, while `T' itself saw it as pristine.
>
> 	Other machines had no problems copying the same file from their local
> disk to the GFS disk.
>
> 	An attempt was made to mount the GFS disk directly on T:
>
> /etc/init.d/pool start
> /etc/init.d/ccsd start
> /etc/init.d/lock_gulmd start
> /etc/init.d/gfs start
> mount /dev/pool/pool_gfs01 -t gfs /mnt
>
> 	(I've never mounted a GFS disk in this way before, so this may be a
> problem - usually its in fstab and `/etc/init.d/gfs start' mounts it)
>
> 	The mount never completed.  The log on the master lockserver showed
> lock_gulm starting on `T' (New Client: idx 10 fd 15 from ...) and about a
> minute later T missed a heartbeat...  seven heartbeats later `T' was fenced,
> and most embarrassingly, rebooted.
>
> 	After the reboot `T' saw all the GFS copies (except those made by other
> machines) as corrupt, but a further copy of the file by `T' to the GFS disk
> showed as corrupt by all nodes except `T' which continued to see it as
> pristine...  i.e. the reboot had not cured the problem...
>
> 	Summary - I have one file, R-2.3.1-1.rh3AS.i386.rpm, which one node,
> `T', cannot successfully copy to the GFS disk, although it thinks it can, and
> can even copy it back, producing a duplicate of the original...
>
> # uname -r
> 2.4.21-47.0.1.ELsmp
>
> # rpm -qa | grep -i gfs
> GFS-devel-6.0.2.36-1
> GFS-6.0.2.36-1
> GFS-modules-smp-6.0.2.36-1
>
> # cat /etc/redhat-release
> Red Hat Enterprise Linux AS release 3 (Taroon Update 8)
>
> sum pristine file:
> 01904 22905
>
> sum corrupt file:
> 57604 22905
>
> 	The above account is an accurate description of the events, only the
> confusion, disbelief and utter panic has been omitted.
>
> 	Looking for suggestions, like what to do next, which list to take
> it to and so on...
>
> 	Thanks
>
> Keith
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   
Hi Keith,

Good question.  In fact, I've answered similar ones before on this list.
I thought I had added it to the cluster faq, but apparently I was 
remiss; sorry.
I just added it now:

http://sources.redhat.com/cluster/faq.html#gfs_corruption

The examples I gave assume that you're using lvm2, which you're not
because you're RHEL3, but it should still give you the gist.

Please let me know if the new faq entry needs some work.

BTW, it was noticed that almost all of your sentences were written
in the passive voice.  The question why presents itself.  ;)

Regards,

Bob Peterson
Red Hat Cluster Suite