[Linux-cluster] I/O to gfs2 hanging or not hanging after heartbeat loss

David Teigland teigland at redhat.com
Fri Apr 15 16:14:37 UTC 2016

> > However, on some occasions, I observe that node A continues in the loop
> > believing that it is successfully writing to the file

node A has the exclusive lock, so it continues writing...

> > but, according to
> > node C, the file stops being updated. (Meanwhile, the file written by
> > node B continues to be up-to-date as read by C.) This is concerning --
> > it looks like I/O writes are being completed on node A even though other
> > nodes in the cluster cannot see the results.

Is node C blocked trying to read the file A is writing?  That what we'd
expect until recovery has removed node A.  Or are C's reads completing
while A continues writing the file?  That would not be correct.

> However, if A happens to own the DLM lock, it does not need
> to ask DLM's permission because it owns the lock. Therefore, it goes
> on writing. Meanwhile, the other node can't get DLM's permission to
> get the lock back, so it hangs.

The description sounds like C might not be hanging in read as we'd expect
while A continues writing.  If that's the case, then it implies that dlm
recovery has been completed by nodes B and C (removing A), which allows
the lock to be granted to C for reading.  If dlm recovery on B/C has
completed, it means that A should have been fenced, so A should not be
able to write once C is given the lock.


