[Linux-cluster] GFS1: node get withdrawn intermittent

Thu Feb 8 18:16:12 UTC 2007

On Thu, Feb 08, 2007 at 10:02:50AM -0800, Sridharan Ramaswamy (srramasw) wrote:
> Interesting. While testing GFS with low jounrnal size and ResourceGroup
> size, I hit the same issue,
> 
> 
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: fatal: assertion "x
> <= length" failed
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2:   function =
> blkalloc_internal 
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2:   file =
> /download/gfs/cluster.cvs-rhel4/gfs-kernel/src/gfs/rgrp.c, line = 1458 
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2:   time = 1170896502
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: about to withdraw
> from the cluster
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: waiting for
> outstanding I/O
> Feb  7 17:01:42 cfs1 kernel: GFS: fsid=cisco:gfs2.2: telling LM to
> withdraw
> 
> 
> This happened on a 3 node GFS over 512M device.
> 
> $ gfs_mkfs -t cisco:gfs2 -p lock_dlm -j 3 -J 8 -r 16 -X /dev/hda12
> 
> I was using bonnie++ to create about 10K files of 1K each from each of 3
> nodes simulataneous.
> 
> Look at the code in rgrp.c it seems related to failure to find a
> particular resource group block. Could this be due to a very low RG size
> I'm using (16M) ??

This is bz 215793 which has been around for quite a while and has been
very difficult for us to reproduce.  Perhaps using a smaller rg size is a
way to reproduce the bug more easily.

Dave