[Linux-cluster] 4U5 CSS/CMAN/fence quorum confusion

Robert Clark cluster at defuturo.co.uk
Mon Jun 11 15:46:32 UTC 2007


On Mon, 2007-06-11 at 12:00 +0100, Patrick Caulfield wrote:
> Robert Clark wrote:
> > On Mon, 2007-06-11 at 11:05 +0100, Patrick Caulfield wrote:
> >> Robert Clark wrote:

> >>>   Is the delay here likely to simply be udev being slow?

> >> It sounds like udev isn't creating it at all. What happens is that libdlm waits
> >> 10 seconds for udev to create the device file, and if it doesn't appear after
> >> that time it will do the job itself.

> >   As an experiment, I've tried just loading the dlm module on a node
> > with no cluster services running and confirmed that dlm-control is being
> > created by udev.
> > 
> >   I must admit - I'm pretty confused now about the role of libdlm. Since
> > it turns out that I've been running a 4U4 cluster without the dlm
> > package installed (and so no libdlm) and, until this morning, my 4U5
> > cluster in the same state, I'm wondering: What uses libdlm?

> I don't think anything is, that might be the problem. But magma (which ccsd uses
> to talk to the cluster manager) checks for the existence of dlm-controld anyway!
> just in case you need to create any locks using magma I suppose.

  OK, I may need to upgrade my condition from confused to baffled...

  I've slapped an strace on udevd during boot and here are some
excerpts:

Jun 11 16:23:48 localhost kernel: CMAN 2.6.9-50.2 (built May 31 2007 15:39:24) installed
Jun 11 16:23:48 localhost kernel: NET: Registered protocol family 30
Jun 11 16:23:48 localhost kernel: DLM 2.6.9-46.16 (built May 31 2007 15:45:51) installed

970   16:23:52 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={9, 0}}, NULL) = 0
970   16:23:52 select(6, [3 5], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted)
970   16:24:01 --- SIGALRM (Alarm clock) @ 0 (0) ---

970   16:24:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf6fc9708) = 3090
3090  16:24:01 execve("/sbin/udev", ["udev", "misc"], [/* 3 vars */])   = 0

3090  16:24:01 mknod("/dev/misc/dlm-control", S_IFCHR|0666, makedev(10, 62))   = 0

So, certainly udev is the main source of the delay (waiting for a
SIGALRM?) but, then, the same version of udev is on both clusters.

  I guess I'll add something to the startup script to wait
for /dev/misc/dlm-control to exist before starting fenced.

	Thanks,

		Robert




More information about the Linux-cluster mailing list