[Linux-cluster] data and machine dependent NFS GFS file xfer problem

Fri Feb 23 01:52:51 UTC 2007

Thursday 22 Feb 2007

	An attempt was made to make sure all the computers in a certain group
had a common set of rpms installed.

	To make this easier, a non-RedHat rpm was copied to a disk that was
mounted on most of the machines, and installed from there.  This broke the RPM
database on those machines.  After the install, most rpm commands got this:

error: rpmdbNextIterator: skipping h# 2325 Header V3 DSA signature: BAD, key
ID 99b62126

	This was fixed by rpm -e <the above rpm> ; rpm --rebuilddb

	It was noticed that all the machines which had installed the shared
rpm had failed in this way, but none of the machines that had installed from a
copy on local disk.

	Using `sum' it was noticed that all the machines except one saw the
file as corrupt.  The one machine, (called `T' from here on), was the one
which had done the original copy.  It still saw the file as pristine - i.e.
not corrupt.

	The shared filesystem is based on GFS, but due to a history of network
and SAN problems causing fence events which seriously degrade our
servicelevel, GFS is restricted to as few machines as possible.  Currently
only three machines, (called C, S and W from here on), mount the GFS disk
directly.  Machines C and W export it to the rest of the group via NFS.

	`T' mounted the GFS disk via NFS through W.  `T' was the only machine
to see the GFS copy as pristine.  All other machines, including C, S and W,
irrespective of whether they mounted the disk by GFS directly or by NFS saw
the file as corrupt.

	`T' then dismounted the disk via W and remounted it via C.  It then saw
the file as corrupt, but it then made another copy of the file from its local
disk to the GFS disk, and this copy too was seen as corrupt by all other
machines, while `T' itself saw it as pristine.

	Other machines had no problems copying the same file from their local
disk to the GFS disk.

	An attempt was made to mount the GFS disk directly on T:

/etc/init.d/pool start
/etc/init.d/ccsd start
/etc/init.d/lock_gulmd start
/etc/init.d/gfs start
mount /dev/pool/pool_gfs01 -t gfs /mnt

	(I've never mounted a GFS disk in this way before, so this may be a
problem - usually its in fstab and `/etc/init.d/gfs start' mounts it)

	The mount never completed.  The log on the master lockserver showed
lock_gulm starting on `T' (New Client: idx 10 fd 15 from ...) and about a
minute later T missed a heartbeat...  seven heartbeats later `T' was fenced,
and most embarrassingly, rebooted.

	After the reboot `T' saw all the GFS copies (except those made by other
machines) as corrupt, but a further copy of the file by `T' to the GFS disk
showed as corrupt by all nodes except `T' which continued to see it as
pristine...  i.e. the reboot had not cured the problem...

	Summary - I have one file, R-2.3.1-1.rh3AS.i386.rpm, which one node,
`T', cannot successfully copy to the GFS disk, although it thinks it can, and
can even copy it back, producing a duplicate of the original...

# uname -r
2.4.21-47.0.1.ELsmp

# rpm -qa | grep -i gfs
GFS-devel-6.0.2.36-1
GFS-6.0.2.36-1
GFS-modules-smp-6.0.2.36-1

# cat /etc/redhat-release
Red Hat Enterprise Linux AS release 3 (Taroon Update 8)

sum pristine file:
01904 22905

sum corrupt file:
57604 22905

	The above account is an accurate description of the events, only the
confusion, disbelief and utter panic has been omitted.

	Looking for suggestions, like what to do next, which list to take
it to and so on...

	Thanks

Keith