[Linux-cluster] Re: SMP and GFS

Mon Oct 3 16:51:51 UTC 2005

On Sun, Oct 02, 2005 at 12:23:05PM +0200, Axel Thimm wrote:
> On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 (2023) req reply einval
> > d6c30333 fr 1 r 1        2
> > Jul 14 14:19:35 atmail-2 kernel: gfs001 send einval to 1
> > Jul 14 14:19:35 atmail-2 last message repeated 2 times

> I found similar log sniplets on a RHEL4U1 machine with dual Xeons (HP
> Proliant). The machine crashed with a kernel panic shortly after
> telling the other nodes to leave the cluster (sorry the staff was
> under pressure and noone wrote down the panic's output):
> 
> Sep 30 05:08:11 zs01 kernel: nval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: data send einval to 1 (P:kernel)
> Sep 30 05:08:11 zs01 kernel: Magma send einval to 1 (P:kernel)

These "einval" messages from the dlm are not necessarily bad and are not
directly related to the "removing from cluster" messages below.  The
einval conditions above can legitimately occur during normal operation and
the dlm should be able to deal with them.  Specifically they mean that:

  1. node A is told that the lock master for resource R is node B
  2. the last lock is removed from R on B
  3. B gives up mastery of R
  4. A sends lock request to B
  5. B doesn't recognize R and returns einval to A
  6. A starts over

The message "send einval to..." is printed on B in step 5.
The message "req reply einval..." is printed on A in step 6.

This is an unfortunate situation, but not lethal.  That said, a spike in
these messages may indicate that something is amiss (and that a "removing
from cluster" may be on the way).  Or, maybe the gfs load has struck the
dlm in a particularly sore way.

> Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
> Missed too many heartbeats (P:kernel)
> Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster :
> No response to messages (P:kernel)

After this happens, the dlm will often return an error (like -EINVAL) to
lock_dlm.  It's not the same thing as above.  Lock_dlm will always panic
at that point since it can no longer acquire locks for gfs.

Dave