[Linux-cluster] data and machine dependent NFS GFS file xfer problem

Keith Lewis Keith.Lewis at its.monash.edu.au
Mon Feb 26 09:19:00 UTC 2007


I Wrote

> Thursday 22 Feb 2007
> 	An attempt was made to make sure all the computers in a certain group
> had a common set of rpms installed.
> ...
> 	Summary - I have one file, R-2.3.1-1.rh3AS.i386.rpm, which one node,
> `T', cannot successfully copy to the GFS disk, although it thinks it can, and
> can even copy it back, producing a duplicate of the original...
> ...
> 	Looking for suggestions, like what to do next, which list to take
> it to and so on...

	Thanks to all who replied.

	Yes it was hardware.

	No it had nothing to do with GFS.

	Yes I'm an idiot.

	No I don't always use passive voice...

	We have since discovered that the 4 machines which were in a group
which we call subnet 12, could not communicate properly with another group in
what we call subnet 13, both subnets are behind a CSM.  In particular if a UDP
fragment happened to consist solely of ones in the data area, the 4th (16 bit)
word would mysteriously get re-written to zero's.  With `tcpdump' we could see
good data flowing out of machine `T' and bad data entering machines `W' and `C'.

	(We are guessing that this started happening a few weeks ago when
various routers and the CSM were upgraded for security reasons)...

	(It had to be UDP.  TCP and ICMP packets did not trigger the problem).

	(btw this also explains the hang/fence/reboot that I mentioned in the 
original mail - the one to zero corruption caused the sender to retry
continuously, making the machine too busy to do heartbeats).

	Thanks again.

Keith




More information about the Linux-cluster mailing list