[Linux-cluster] Possible problem with different architectures

Gareth Bult Gareth at Bult.co.uk
Sun Jul 4 00:09:42 UTC 2004


Hi,

With help from the guys on #linux-cluster ( thanks guys :) ) I've
managed to get a 3-node cluster running.

Two of the nodes are x86 and the third is an amd64 - all are running
identical Gentoo installs on kernel 2.6.7.
All are running an up-to-date cvs /cluster.

I can successfully export a device from one x86 box to another, then
format/mount a gfs on it on both x86 boxes - this works great.

However, I can't run gnbd_import on the amd64 box.  I get;

gnbd_import: /dev/gnbd/netdisc is not in use. deleting
gnbd_import: created gnbd device netdisc2
gnbd_monitor: gnbd_monitor started. Monitoring device #0
<gnbd_import does not return, Ctrl-C at this point>
gnbd_import: ERROR gnbd_recvd failed

It "looks" like gnbd_recvd is failing to complete a handshake, i.e.
hanging half way through ..
.. Any suggestions welcome.

On another note, I've had a number of kernel crashes and I'm wondering
looking at the logs whether it's because I'm running a preemtable
kernel ... ?

Here are two sample crash dumps from syslog.. typically the machine goes
D-state on the processes involved and won't shutdown cleanly ...

Crash #1 (x86 box):

Jul  3 22:44:48 rag CMAN: node squizzey.linux.co.uk is not responding -
removing from the cluster
Jul  3 22:44:53 rag dlm: clvmd: recover event 2 (first)
Jul  3 22:44:53 rag dlm: clvmd: add nodes
Jul  3 22:44:53 rag Unable to handle kernel paging request at virtual
address 0c000000
Jul  3 22:44:53 rag printing eip:
Jul  3 22:44:53 rag c013c2cb
Jul  3 22:44:53 rag *pde = 00000000
Jul  3 22:44:53 rag Oops: 0000 [#1]
Jul  3 22:44:53 rag PREEMPT
Jul  3 22:44:53 rag Modules linked in: gnbd gfs lock_dlm dlm cman
lock_harness ohci_hcd e100 mii snd_intel8x0 snd_ac97_codec snd_pcm
snd_timer snd_page_alloc gameport snd_mpu401_uart snd_rawmidi
snd_seq_device snd uhci_hcd intel_agp agpgart st usb_storage scsi_mod
ehci_hcd usbcore
Jul  3 22:44:53 rag CPU:    0
Jul  3 22:44:53 rag EIP:    0060:[<c013c2cb>]    Not tainted
Jul  3 22:44:53 rag EFLAGS: 00010292   (2.6.7)
Jul  3 22:44:53 rag EIP is at page_address+0xb/0xb0
Jul  3 22:44:53 rag eax: 0c000000   ebx: 0c000000   ecx: 00000000   edx:
18e0e600
Jul  3 22:44:53 rag esi: 18e0e600   edi: e0e600b8   ebp: e0e600e8   esp:
e0e15e1c
Jul  3 22:44:53 rag ds: 007b   es: 007b   ss: 0068
Jul  3 22:44:53 rag Process dlm_recoverd (pid: 9579, threadinfo=e0e14000
task=e6542eb0)
Jul  3 22:44:53 rag Stack: 00000000 e0e60001 18e0e600 e0e600b8 e0e600e8
e85baee1 0c000000 e85c84b7
Jul  3 22:44:53 rag 18e0e600 18000000 00000018 e0e15ee0 00000002
00000002 e85bb3cf 00000002
Jul  3 22:44:53 rag 00000018 000000d0 e0e15e6c 00000000 00000000
00000018 e0e15ee0 00000002
Jul  3 22:44:53 rag Call Trace:
Jul  3 22:44:53 rag [<e85baee1>] lowcomms_get_buffer+0x81/0x150 [dlm]
Jul  3 22:44:53 rag [<e85bb3cf>] lowcomms_send_message+0x3f/0xf0 [dlm]
Jul  3 22:44:53 rag [<e85bccf4>] midcomms_send_message+0x44/0x70 [dlm]
Jul  3 22:44:53 rag [<e85c1621>] rcom_send_message+0xd1/0x210 [dlm]
Jul  3 22:44:53 rag [<e85c23f0>] gdlm_wait_status_low+0x60/0x90 [dlm]
Jul  3 22:44:53 rag [<e85bd07a>] nodes_reconfig_wait+0x2a/0x80 [dlm]
Jul  3 22:44:53 rag [<e85bd57f>] ls_nodes_init+0xbf/0x150 [dlm]
Jul  3 22:44:53 rag [<e85c31d2>] ls_first_start+0x62/0x160 [dlm]
Jul  3 22:44:53 rag [<e85c420d>] do_ls_recovery+0x1ed/0x430 [dlm]
Jul  3 22:44:53 rag [<e85c4593>] dlm_recoverd+0x143/0x180 [dlm]
Jul  3 22:44:53 rag [<c0114620>] default_wake_function+0x0/0x20
Jul  3 22:44:53 rag [<c0105c72>] ret_from_fork+0x6/0x14
Jul  3 22:44:53 rag [<c0114620>] default_wake_function+0x0/0x20
Jul  3 22:44:53 rag [<e85c4450>] dlm_recoverd+0x0/0x180 [dlm]
Jul  3 22:44:53 rag [<c0103f4d>] kernel_thread_helper+0x5/0x18
Jul  3 22:44:53 rag
Jul  3 22:44:53 rag Code: 8b 03 f6 c4 01 75 1e 8b 2d 8c 63 48 c0 29 eb
c1 fb 05 c1 e3
Jul  3 22:44:53 rag ccsd[9560]: Error while processing get: No data
available

Crash #2: (amd64)

Jul  3 21:42:28 squizzey dlm: clvmd: recover event 2 (first)
Jul  3 21:42:28 squizzey dlm: clvmd: add nodes
Jul  3 21:42:28 squizzey Unable to handle kernel NULL pointer
dereference at 000000000000008a RIP:
Jul  3 21:42:28 squizzey <ffffffffa06b5dc6>{:dlm:send_to_sock+54}
Jul  3 21:42:28 squizzey PML4 3f7a9067 PGD b591067 PMD 0
Jul  3 21:42:28 squizzey Oops: 0000 [1] PREEMPT
Jul  3 21:42:28 squizzey CPU 0
Jul  3 21:42:28 squizzey Modules linked in: gnbd lock_dlm dlm cman gfs
lock_harness dm_mod ipt_ttl ipt_limit ipt_state iptable_filter
iptable_mangle ipt_LOG ipt_MASQUERADE ipt_TOS ipt_REDIRECT iptable_nat
ipt_REJECT ip_tables ip_conntrack_irc ip_conntrack_ftp ip_conntrack
nvidia usblp usbhid forcedeth ohci_hcd snd_intel8x0 snd_ac97_codec
snd_mpu401_uart snd_rawmidi snd_seq_oss snd_seq_midi_event snd_seq
snd_seq_device snd_pcm_oss snd_pcm snd_page_alloc snd_timer
snd_mixer_oss snd usb_storage ehci_hcd usbcore
Jul  3 21:42:28 squizzey Pid: 31748, comm: dlm_sendd Tainted: P   2.6.7
Jul  3 21:42:28 squizzey RIP: 0010:[<ffffffffa06b5dc6>]
<ffffffffa06b5dc6>{:dlm:send_to_sock+54}
Jul  3 21:42:28 squizzey RSP: 0018:00000100319b5ec8  EFLAGS: 00010202
Jul  3 21:42:28 squizzey RAX: 0000000000000002 RBX: ffffffffa06ca0f0
RCX: 00000100139c80c0
Jul  3 21:42:28 squizzey RDX: 0000000000000000 RSI: 00000000ffffffff
RDI: 00000100139c80b8
Jul  3 21:42:28 squizzey RBP: 00000100139c80a8 R08: 00000100319b4000
R09: 0000000000000000
Jul  3 21:42:28 squizzey R10: 00000000ffffffff R11: 0000000000000000
R12: 0000010030d1d150
Jul  3 21:42:28 squizzey R13: 00000100139c80a8 R14: 0000000000000000
R15: 000000358cc16f78
Jul  3 21:42:28 squizzey FS:  000000358d80f640(0000) GS:ffffffff804f61c0
(0000) knlGS:0000000000000000
Jul  3 21:42:28 squizzey CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Jul  3 21:42:28 squizzey CR2: 000000000000008a CR3: 0000000000101000
CR4: 00000000000006e0
Jul  3 21:42:28 squizzey Process dlm_sendd (pid: 31748, threadinfo
00000100319b4000, task 000001000676a000)
Jul  3 21:42:28 squizzey Stack: 0000007a319b5f08 00000100139c80b8
0000000000000a64 ffffffffa06ca0f0
Jul  3 21:42:28 squizzey 00000100139c80a8 0000010030d1d150
0000000000000005 00000100297df89c
Jul  3 21:42:28 squizzey 000000358cc16f78 ffffffffa06b637d
Jul  3 21:42:28 squizzey Call Trace:<ffffffffa06b637d>
{:dlm:process_output_queue+157} <ffffffffa06b68b8>{:dlm:dlm_sendd+184}
Jul  3 21:42:28 squizzey <ffffffff8011126f>{child_rip+8}
<ffffffffa06b6800>{:dlm:dlm_sendd+0}
Jul  3 21:42:28 squizzey <ffffffff80111267>{child_rip+0}
Jul  3 21:42:28 squizzey
Jul  3 21:42:28 squizzey Code: 48 8b 80 88 00 00 00 48 89 44 24 10 65 48
8b 04 25 18 00 00

-- 
Gareth Bult <Gareth at Bult.co.uk>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040704/07d08853/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smiley-3.png
Type: image/png
Size: 819 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040704/07d08853/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20040704/07d08853/attachment.sig>


More information about the Linux-cluster mailing list