[Linux-cluster] I/O to gfs2 hanging or not hanging after heartbeat loss
teigland at redhat.com
Fri Apr 15 16:14:37 UTC 2016
> > However, on some occasions, I observe that node A continues in the loop
> > believing that it is successfully writing to the file
node A has the exclusive lock, so it continues writing...
> > but, according to
> > node C, the file stops being updated. (Meanwhile, the file written by
> > node B continues to be up-to-date as read by C.) This is concerning --
> > it looks like I/O writes are being completed on node A even though other
> > nodes in the cluster cannot see the results.
Is node C blocked trying to read the file A is writing? That what we'd
expect until recovery has removed node A. Or are C's reads completing
while A continues writing the file? That would not be correct.
> However, if A happens to own the DLM lock, it does not need
> to ask DLM's permission because it owns the lock. Therefore, it goes
> on writing. Meanwhile, the other node can't get DLM's permission to
> get the lock back, so it hangs.
The description sounds like C might not be hanging in read as we'd expect
while A continues writing. If that's the case, then it implies that dlm
recovery has been completed by nodes B and C (removing A), which allows
the lock to be granted to C for reading. If dlm recovery on B/C has
completed, it means that A should have been fenced, so A should not be
able to write once C is given the lock.
More information about the Linux-cluster