From Alain.Moulle at bull.net Mon Mar 2 14:12:48 2009 From: Alain.Moulle at bull.net (Alain.Moulle) Date: Mon, 02 Mar 2009 15:12:48 +0100 Subject: [Linux-cluster] CS5 & qdisk : details about problem "Node is undead" (contd.) Message-ID: <49ABE960.6040205@bull.net> Hi, I have a more precise sequence in the syslog on the node entering the infernal loop "Node is undead", I think it could give some more indication about the problem, if someone could help me ... Thanks a lot Regards Alain [root at node3 ~]# tail -f /var/log/syslog /var/log/daemons/*& [1] 12017 [root at node3 ~]# ==> /var/log/syslog <== Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Initial score 1/1 Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Initialization complete Mar 2 14:07:29 s_sys at node3 openais[11209]: [CMAN ] quorum device registered Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Score sufficient for master operation (1/1; required=1); upgrading Mar 2 14:07:38 s_sys at node3 qdiskd[11100]: Node 1 is the master *===> Here is the poweroff -nf on node2* Mar 2 14:07:59 s_sys at node3 qdiskd[11100]: Node 1 missed an update (2/16) Mar 2 14:08:00 s_sys at node3 qdiskd[11100]: Node 1 missed an update (3/16) Mar 2 14:08:01 s_sys at node3 qdiskd[11100]: Node 1 missed an update (4/16) Mar 2 14:08:02 s_sys at node3 qdiskd[11100]: Node 1 missed an update (5/16) Mar 2 14:08:03 s_sys at node3 qdiskd[11100]: Node 1 missed an update (6/16) ==> /var/log/daemons/errors <== Mar 2 13:18:24 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:18:24 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:18:25 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:18:25 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:46:31 s_sys at node3 mountd[16139]: Caught signal 15, un-registering and exiting. Mar 2 13:46:44 s_sys at node3 dlm_controld[15234]: cluster is down, exiting Mar 2 13:46:44 s_sys at node3 gfs_controld[15240]: cluster is down, exiting Mar 2 13:46:44 s_sys at node3 fenced[15228]: cluster is down, exiting Mar 2 13:46:44 s_sys at node3 qdiskd[15156]: cman_dispatch: Host is down Mar 2 13:46:44 s_sys at node3 qdiskd[15156]: Halting qdisk operations ==> /var/log/daemons/info <== Mar 2 14:06:58 s_sys at node3 openais[11209]: [MAIN ] Using default multicast address of 239.192.0.0 Mar 2 14:06:58 s_sys at node3 openais[11209]: [CMAN ] CMAN 2.0.98 (built Jan 27 2009 10:53:06) started Mar 2 14:06:59 s_sys at node3 openais[11209]: [CMAN ] quorum regained, resuming activity Mar 2 14:07:12 s_sys at node3 qdiskd[11100]: Quorum Partition: /dev/disk/by-id/scsi-3600601607b801c00343ab689eb03de11 Label: QDISK_0_0 Mar 2 14:07:12 s_sys at node3 qdiskd[11100]: Quorum Daemon Initializing Mar 2 14:07:13 s_sys at node3 qdiskd[11100]: Heuristic: 'ping -W1 -c1 -t3 12.11.2.1; RES=$?; if [ $RES -ne 0 ]; then ping -W1 -c1 -t3 12.11.2.1; RES=$?; if [ $RES -ne 0 ]; then ping -W3 -c1 -t3 12.11.2.1; RES=$?; fi; fi; echo $RES | grep -w 0 ' UP Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Initial score 1/1 Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Initialization complete Mar 2 14:07:29 s_sys at node3 openais[11209]: [CMAN ] quorum device registered Mar 2 14:07:38 s_sys at node3 qdiskd[11100]: Node 1 is the master ==> /var/log/daemons/warnings <== Mar 2 10:25:54 s_sys at node3 clurgmgrd: [5635]: Link for eth0: Not detected Mar 2 10:25:54 s_sys at node3 clurgmgrd: [5635]: No link on eth0... Mar 2 10:26:14 s_sys at node3 clurgmgrd: [5635]: Link for eth0: Not detected Mar 2 10:26:14 s_sys at node3 clurgmgrd: [5635]: No link on eth0... Mar 2 10:26:24 s_sys at node3 clurgmgrd: [5635]: Link for eth0: Not detected Mar 2 10:26:24 s_sys at node3 clurgmgrd: [5635]: No link on eth0... Mar 2 10:30:31 s_sys at node3 avahi-daemon[25488]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 2 11:11:15 s_sys at node3 avahi-daemon[25474]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 2 11:28:24 s_sys at node3 avahi-daemon[25466]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 2 13:56:47 s_sys at node3 avahi-daemon[25408]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! [root at node3 ~]# ==> /var/log/syslog <== Mar 2 14:08:04 s_sys at node3 qdiskd[11100]: Node 1 missed an update (7/16) Mar 2 14:08:05 s_sys at node3 qdiskd[11100]: Node 1 missed an update (8/16) Mar 2 14:08:06 s_sys at node3 qdiskd[11100]: Node 1 missed an update (9/16) Mar 2 14:08:07 s_sys at node3 qdiskd[11100]: Node 1 missed an update (10/16) Mar 2 14:08:08 s_sys at node3 qdiskd[11100]: Node 1 missed an update (11/16) Mar 2 14:08:09 s_sys at node3 qdiskd[11100]: Node 1 missed an update (12/16) Mar 2 14:08:10 s_sys at node3 qdiskd[11100]: Node 1 missed an update (13/16) Mar 2 14:08:11 s_sys at node3 qdiskd[11100]: Node 1 missed an update (14/16) Mar 2 14:08:12 s_sys at node3 qdiskd[11100]: Node 1 missed an update (15/16) Mar 2 14:08:13 s_sys at node3 qdiskd[11100]: Node 1 missed an update (16/16) Mar 2 14:08:14 s_sys at node3 qdiskd[11100]: Node 1 missed an update (17/16) Mar 2 14:08:14 s_sys at node3 qdiskd[11100]: Node 1 DOWN Mar 2 14:08:14 s_sys at node3 qdiskd[11100]: Making bid for master Mar 2 14:08:15 s_sys at node3 qdiskd[11100]: Node 1 missed an update (18/16) Mar 2 14:08:16 s_sys at node3 qdiskd[11100]: Node 1 missed an update (19/16) Mar 2 14:08:17 s_sys at node3 qdiskd[11100]: Node 1 missed an update (20/16) Mar 2 14:08:18 s_sys at node3 qdiskd[11100]: Node 1 missed an update (21/16) Mar 2 14:08:19 s_sys at node3 qdiskd[11100]: Node 1 missed an update (22/16) Mar 2 14:08:20 s_sys at node3 qdiskd[11100]: Node 1 missed an update (23/16) Mar 2 14:08:21 s_sys at node3 qdiskd[11100]: Node 1 missed an update (24/16) Mar 2 14:08:21 s_sys at node3 qdiskd[11100]: Assuming master role ==> /var/log/daemons/info <== Mar 2 14:08:21 s_sys at node3 qdiskd[11100]: Assuming master role ==> /var/log/syslog <== Mar 2 14:08:22 s_sys at node3 qdiskd[11100]: Node 1 missed an update (25/16) Mar 2 14:08:22 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:22 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:23 s_sys at node3 qdiskd[11100]: Node 1 evicted Mar 2 14:08:24 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:24 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:24 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:25 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:25 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:25 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:26 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:26 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:26 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:27 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:27 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:27 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:28 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:28 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:28 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:29 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:29 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:29 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:29 s_sys at node3 openais[11209]: [TOTEM] The token was lost in the OPERATIONAL state. Mar 2 14:08:29 s_sys at node3 openais[11209]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Mar 2 14:08:29 s_sys at node3 openais[11209]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Mar 2 14:08:29 s_sys at node3 openais[11209]: [TOTEM] entering GATHER state from 2. Mar 2 14:08:30 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:30 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:30 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:31 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:31 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:31 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:32 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:32 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:32 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:33 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:33 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:33 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] entering GATHER state from 0. Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] Creating commit token because I am the rep. Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] Saving state aru 49 high seq received 49 Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] Storing new sequence id for ring 27c Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] entering COMMIT state. Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] entering RECOVERY state. Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] position [0] member 12.11.2.4: Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] previous ring seq 632 rep 12.11.2.3 Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] aru 49 high delivered 49 received flag 1 Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] Did not need to originate any messages in recovery. Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] Sending initial ORF token Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] CLM CONFIGURATION CHANGE Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] New Configuration: Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] r(0) ip(12.11.2.4) Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] Members Left: Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] r(0) ip(12.11.2.3) Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] Members Joined: Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0641] confchg, low nodeid=2, us = 2 Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0651] confchg, build downlist: 1 nodes Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] CLM CONFIGURATION CHANGE Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] New Configuration: Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] r(0) ip(12.11.2.4) Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] Members Left: Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] Members Joined: Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0662] confchg, sent downlist Mar 2 14:08:34 s_sys at node3 openais[11209]: [SYNC ] This node is within the primary component and will provide service. Mar 2 14:08:34 s_sys at node3 openais[11209]: [TOTEM] entering OPERATIONAL state. Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0785] downlist left_list: 1 Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0393] Sending new joinlist (1 elements) to clients Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0393] Sending new joinlist (1 elements) to clients Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0393] Sending new joinlist (1 elements) to clients Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0393] Sending new joinlist (1 elements) to clients Mar 2 14:08:34 s_sys at node3 openais[11209]: [CLM ] got nodejoin message 12.11.2.4 Mar 2 14:08:34 s_kernel at node3 kernel: dlm: closing connection to node 1 Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:0959] sending joinlist to cluster Mar 2 14:08:34 s_sys at node3 openais[11209]: [CPG ] got joinlist message from node 2 Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:1114] got mcast request on 0x128a57e0 Mar 2 14:08:34 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:34 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:34 s_sys at node3 openais[11209]: [cpg.c:1114] got mcast request on 0x128a57e0 Mar 2 14:08:34 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:34 s_sys at node3 fenced[11229]: node2 not a cluster member after 0 sec post_fail_delay Mar 2 14:08:34 s_sys at node3 fenced[11229]: fencing node "node2" ==> /var/log/daemons/info <== Mar 2 14:08:34 s_sys at node3 fenced[11229]: node2 not a cluster member after 0 sec post_fail_delay Mar 2 14:08:34 s_sys at node3 fenced[11229]: fencing node "node2" ==> /var/log/syslog <== Mar 2 14:08:35 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:35 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:35 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:36 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:36 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:36 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:37 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:37 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:37 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:38 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:38 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:38 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node [root at node3 ~]# Mar 2 14:08:39 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:39 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:39 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:39 s_sys at node3 fenced[11229]: fence "node2" success Mar 2 14:08:39 s_sys at node3 openais[11209]: [cpg.c:1114] got mcast request on 0x128a57e0 Mar 2 14:08:39 s_sys at node3 openais[11209]: [cpg.c:1114] got mcast request on 0x128a57e0 ==> /var/log/daemons/info <== Mar 2 14:08:39 s_sys at node3 fenced[11229]: fence "node2" success fg ==> /var/log/syslog <== Mar 2 14:08:40 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:40 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:40 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node tail -f /var/log/syslog /var/log/daemons/* [root at node3 ~]# tail -f /var/log/syslog /var/log/daemons/*& [1] 12044 [root at node3 ~]# ==> /var/log/syslog <== Mar 2 14:08:41 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:42 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:42 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:42 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:43 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:43 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:43 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:44 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:44 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:44 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node ==> /var/log/daemons/errors <== Mar 2 13:18:24 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:18:24 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:18:25 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:18:25 s_sys at node3 ccsd[15202]: Error while processing connect: Connection refused Mar 2 13:46:31 s_sys at node3 mountd[16139]: Caught signal 15, un-registering and exiting. Mar 2 13:46:44 s_sys at node3 dlm_controld[15234]: cluster is down, exiting Mar 2 13:46:44 s_sys at node3 gfs_controld[15240]: cluster is down, exiting Mar 2 13:46:44 s_sys at node3 fenced[15228]: cluster is down, exiting Mar 2 13:46:44 s_sys at node3 qdiskd[15156]: cman_dispatch: Host is down Mar 2 13:46:44 s_sys at node3 qdiskd[15156]: Halting qdisk operations ==> /var/log/daemons/info <== Mar 2 14:07:12 s_sys at node3 qdiskd[11100]: Quorum Daemon Initializing Mar 2 14:07:13 s_sys at node3 qdiskd[11100]: Heuristic: 'ping -W1 -c1 -t3 12.11.2.1; RES=$?; if [ $RES -ne 0 ]; then ping -W1 -c1 -t3 12.11.2.1; RES=$?; if [ $RES -ne 0 ]; then ping -W3 -c1 -t3 12.11.2.1; RES=$?; fi; fi; echo $RES | grep -w 0 ' UP Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Initial score 1/1 Mar 2 14:07:29 s_sys at node3 qdiskd[11100]: Initialization complete Mar 2 14:07:29 s_sys at node3 openais[11209]: [CMAN ] quorum device registered Mar 2 14:07:38 s_sys at node3 qdiskd[11100]: Node 1 is the master Mar 2 14:08:21 s_sys at node3 qdiskd[11100]: Assuming master role Mar 2 14:08:34 s_sys at node3 fenced[11229]: node2 not a cluster member after 0 sec post_fail_delay Mar 2 14:08:34 s_sys at node3 fenced[11229]: fencing node "node2" Mar 2 14:08:39 s_sys at node3 fenced[11229]: fence "node2" success ==> /var/log/daemons/warnings <== Mar 2 10:25:54 s_sys at node3 clurgmgrd: [5635]: Link for eth0: Not detected Mar 2 10:25:54 s_sys at node3 clurgmgrd: [5635]: No link on eth0... Mar 2 10:26:14 s_sys at node3 clurgmgrd: [5635]: Link for eth0: Not detected Mar 2 10:26:14 s_sys at node3 clurgmgrd: [5635]: No link on eth0... Mar 2 10:26:24 s_sys at node3 clurgmgrd: [5635]: Link for eth0: Not detected Mar 2 10:26:24 s_sys at node3 clurgmgrd: [5635]: No link on eth0... Mar 2 10:30:31 s_sys at node3 avahi-daemon[25488]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 2 11:11:15 s_sys at node3 avahi-daemon[25474]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 2 11:28:24 s_sys at node3 avahi-daemon[25466]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! Mar 2 13:56:47 s_sys at node3 avahi-daemon[25408]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns! [root at node3 ~]# [root at node3 ~]# [root at node3 ~]# [root at node3 ~]# ==> /var/log/syslog <== Mar 2 14:08:45 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:45 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:45 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node tail -f /var/log/syslog /var/log/daemons/Mar 2 14:08:46 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:46 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:46 s_sys at node3 qdiskd[11100]: Telling CMAN Mar 2 14:08:47 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:47 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:47 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:48 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:48 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:48 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:49 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:49 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:49 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:50 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:50 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:50 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:51 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:51 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:51 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:52 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:52 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:52 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:53 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:53 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:53 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:54 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:54 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:54 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:55 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:55 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:55 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:56 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:56 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:56 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:57 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:57 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:57 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:58 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:58 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:58 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:08:59 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:08:59 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:08:59 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:09:00 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:09:00 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:09:00 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:09:01 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:09:01 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:09:01 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:09:02 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:09:02 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 Mar 2 14:09:02 s_sys at node3 qdiskd[11100]: Telling CMAN to kill the node Mar 2 14:09:03 s_sys at node3 qdiskd[11100]: Node 1 is undead. Mar 2 14:09:03 s_sys at node3 qdiskd[11100]: Writing eviction notice for node 1 M -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Tue Mar 3 06:49:02 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 03 Mar 2009 07:49:02 +0100 Subject: [Linux-cluster] Cluster 3.0.0.alpha6 released Message-ID: <1236062942.21723.17.camel@cerberus.int.fabbione.net> The cluster team and its community are proud to announce the 3.0.0.alpha6 release from the STABLE3 branch. The development cycle for 3.0.0 is about to end. The STABLE3 branch is now collecting only bug fixes and minimal update required to build on top of the latest upstream kernel/corosync/openais, we are getting closer and closer to a shiny new stable release. Everybody with test equipment and time to spare, is highly encouraged to download, install and test the 3.0.0.alpha releases and more important report problems. This is the time for people to make a difference and help us testing as much as possible. In order to build the 3.0.0.alpha6 release you will need: - corosync 0.94 - openais 0.93 - linux kernel 2.6.28 (for gfs1 kernel module) The new source tarball can be downloaded here: ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.alpha6.tar.gz https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.alpha6.tar.gz At the same location is now possible to find separated tarballs for fence-agents and resource-agents as previously announced (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.html) To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Happy clustering, Fabio Under the hood (from 3.0.0.alpha5): Christine Caulfield (2): cman/config: Change handle code to work with new corosync API ccs_tool: Add an option to disable full XPath parsing David Teigland (14): libfence/fenced/fence_node: add logging to fence_node() libfence/fence_node: unfencing fence_node: unfence local node by default libfence/fence_node: use different exit code when unfencing undefined libfence: fix unfence_node return value fenced: check for group/groupd_compat at startup init.d/cman: unfence local node fenced/dlm_controld/gfs_controld: clean up cman admin handle fenced: cpg_finalize groupd: cpg_finalize dlm_controld: cpg_finalize gfs_controld: cpg_finalize dlm_controld: finalize ckpt handle gfs_controld: finalize ckpt handle Fabio M. Di Nitto (13): cmannotifyd: add options to init script to handle daemon startup fence: fix fence agents man page installation libfence: fix premature free of victim name qdisk: fix pid check in shutdown operation libccs: cope with new hdb from corosync copyright: allow non-redhat-copyright in fence agents ifmib: fix copyright attribution fence_node: add missing cman_node_t struct initialization init: make groupd startup conditional init: stop daemons before cman init: fix return code check in cman libccs: unbreak full xpath one more time libccs: fix fallout from uint to hbd conversion Jan Friesse (3): fence_vmware: Changed long option for datacenter fence_*.pl: Fix Perl scripts to force stdout close fence_*.py: Fix Python (fencing.py based) scripts to force stdout close Juanjo Villaplana (2): resource-agents: fix netfsclient cache handling qdiskd: Fix init script 'restart' operation Lon Hohberger (7): rgmanager: Enable exclusive prioritization use case qdiskd: Warn if we find >1 label with the same name qdiskd: Fix logging call for warning cman: Make fence_xvm's metadata not crash XML parsers cman: Just print; no need for a buffer in fence_xvm fence: Revert change to 'required' for 'action' operation config: RelaxNG Schema for stable3 branch Mark Hlawatschek (1): resource-agents: Tweak environment for SAP resource agents Yevheniy Demchenko (1): rgmanager: Fix poor -F handling on enable cman/daemon/ais.c | 8 +- cman/daemon/cman-preconfig.c | 62 +- cman/daemon/cmanconfig.c | 18 +- cman/daemon/commands.c | 8 +- cman/daemon/daemon.c | 28 +- cman/daemon/nodelist.h | 22 +- cman/init.d/Makefile | 1 + cman/init.d/cman.in | 67 +- cman/init.d/qdiskd.in | 4 +- cman/qdisk/main.c | 6 +- cman/qdisk/proc.c | 4 +- config/libs/libccsconfdb/ccs_internal.h | 14 +- config/libs/libccsconfdb/fullxpath.c | 45 +- config/libs/libccsconfdb/libccs.c | 74 +- config/libs/libccsconfdb/xpathlite.c | 16 +- config/plugins/ldap/configldap.c | 14 +- config/plugins/xml/cluster.rng | 2376 +++++++++++++++++++++++ config/plugins/xml/config.c | 6 +- config/tools/ccs_tool/ccs_tool.c | 13 +- fence/agents/alom/fence_alom.py | 2 + fence/agents/apc/fence_apc.py | 2 + fence/agents/baytech/fence_baytech.pl | 9 + fence/agents/bladecenter/fence_bladecenter.py | 2 + fence/agents/brocade/fence_brocade.pl | 9 + fence/agents/bullpap/fence_bullpap.pl | 9 + fence/agents/cpint/fence_cpint.pl | 9 + fence/agents/drac/fence_drac.pl | 9 + fence/agents/drac/fence_drac5.py | 2 + fence/agents/egenera/fence_egenera.pl | 9 + fence/agents/eps/fence_eps.py | 2 + fence/agents/ibmblade/fence_ibmblade.pl | 9 + fence/agents/ifmib/fence_ifmib.py | 8 +- fence/agents/ilo/fence_ilo.py | 2 + fence/agents/ldom/fence_ldom.py | 3 +- fence/agents/lib/fencing.py.py | 13 +- fence/agents/lpar/fence_lpar.py | 2 + fence/agents/mcdata/fence_mcdata.pl | 9 + fence/agents/sanbox2/fence_sanbox2.pl | 9 + fence/agents/scsi/fence_scsi.pl | 9 + fence/agents/virsh/fence_virsh.py | 2 + fence/agents/vixel/fence_vixel.pl | 9 + fence/agents/vmware/fence_vmware.py | 2 + fence/agents/wti/fence_wti.py | 2 + fence/agents/xcat/fence_xcat.pl | 9 + fence/agents/xvm/options.c | 25 +- fence/agents/zvm/fence_zvm.pl | 9 + fence/fence_node/Makefile | 4 +- fence/fence_node/fence_node.c | 177 ++- fence/fenced/config.c | 63 +- fence/fenced/cpg.c | 14 +- fence/fenced/member_cman.c | 5 + fence/fenced/recover.c | 60 +- fence/libfence/Makefile | 3 + fence/libfence/agent.c | 405 ++++- fence/libfence/libfence.h | 27 +- fence/man/Makefile | 12 +- group/daemon/cpg.c | 28 + group/daemon/gd_internal.h | 2 + group/daemon/main.c | 6 +- group/dlm_controld/cpg.c | 11 +- group/dlm_controld/dlm_daemon.h | 1 + group/dlm_controld/main.c | 4 +- group/dlm_controld/member_cman.c | 5 + group/dlm_controld/plock.c | 9 + group/gfs_controld/cpg-new.c | 11 +- group/gfs_controld/cpg-old.c | 6 +- group/gfs_controld/gfs_daemon.h | 1 + group/gfs_controld/main.c | 4 +- group/gfs_controld/member_cman.c | 6 + group/gfs_controld/plock.c | 8 + group/man/groupd.8 | 6 +- rgmanager/src/daemons/rg_state.c | 9 +- rgmanager/src/resources/SAPDatabase | 19 +- rgmanager/src/resources/SAPInstance | 14 +- rgmanager/src/resources/default_event_script.sl | 154 ++- rgmanager/src/resources/nfsclient.sh | 6 +- scripts/fenceparse | 5 +- 77 files changed, 3714 insertions(+), 343 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From bpejakov at fina.hr Tue Mar 3 10:22:32 2009 From: bpejakov at fina.hr (Branimir) Date: Tue, 03 Mar 2009 11:22:32 +0100 Subject: [Linux-cluster] RHEL cluster 5 fencing problem Message-ID: <49AD04E8.2020609@fina.hr> Hi list, I am testing RHEL 5 cluster for production (two-node cluster). So far, so good :) I simulated network outage and fencing works just fine here. But there is a problem. When I simulate power outage - literally I pull the plug from one of the nodes - in that moment, the node that is still alive tries to fence the node that is powered down.. and it tries, and it tries but it can't do that because the node that is powered down is powered down :) The living node cannot acquire cluster resources until it fences the dead node (which is powered down) and those retries go to infinity. I was told to use tie breaker IP address to solve this. But as I understand this (thread "[Linux-cluster] two node cluster with IP tiebreaker failed"), I have to use quorum disk to do that - or better to say to use heuristics with ping, for example). Do I really have to use quorum disk for this case? Is there some other solution? I would appreciate any hint! Thanks in advance, Branimir From Alain.Moulle at bull.net Tue Mar 3 13:57:06 2009 From: Alain.Moulle at bull.net (Alain.Moulle) Date: Tue, 03 Mar 2009 14:57:06 +0100 Subject: [Linux-cluster] CS5 & qdisk : details about problem "Node is undead" (contd.) Message-ID: <49AD3732.90805@bull.net> Hi, some more information about my problem (see previous email for complete log) (sorry if someone sent to me a response but I've not got any for now perhaps because I'm in digest mode) But if I start qdiskd after cman and in foreground likewise : /qdiskd -ddd -f/ There is no more the problem of infernal loop "Node is undead" , the log on stdout is : /[11136] info: Initial score 1/1 [11136] info: Initialization complete [11136] notice: Score sufficient for master operation (1/1; required=1); upgrading [11136] debug: Making bid for master [11136] info: Assuming master role [11136] debug: Node 1 is UP [11136] debug: Node 1 missed an update (2/10) [11136] debug: Node 1 missed an update (3/10) [11136] debug: Node 1 missed an update (4/10) [11136] debug: Node 1 missed an update (5/10) [11136] debug: Node 1 missed an update (6/10) [11136] debug: Node 1 missed an update (7/10) [11136] debug: Node 1 missed an update (8/10) [11136] debug: Node 1 missed an update (9/10) [11136] debug: Node 1 missed an update (10/10) [11136] debug: Node 1 missed an update (11/10) [11136] debug: Node 1 DOWN [11136] notice: Writing eviction notice for node 1 [11136] debug: Telling CMAN to kill the node [11136] notice: Node 1 evicted/ and the log ends here, and there is no problem "Node x is undead" and when the node1 is rebooted, the launch of CS on node 1 is successful (the node1 enters the cluster without problem as the node2 has not flagged it as "undead" ). I don't know if this test could be meaningful for you to help me to identify the problem . Lon, perhaps have you an idea about this ? Because I'm sure the Node1 is not writing again its timestamp avec the 11 "missing an update" , so the cause of "Node is undead" is not that Node2 got a timestamp of Node1 in quorum after having declared it Off. Thanks for your help Regards Alain -------------- next part -------------- An HTML attachment was scrubbed... URL: From hlawatschek at atix.de Tue Mar 3 14:37:53 2009 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Tue, 3 Mar 2009 15:37:53 +0100 Subject: [Linux-cluster] CMAN: sending membership request, unable to join cluster. In-Reply-To: <499297AA.8030200@redhat.com> References: <1234342644.11142.142.camel@thijn.enovation.oper> <499297AA.8030200@redhat.com> Message-ID: <200903031537.53544.hlawatschek@atix.de> I'm seeing the same problem in a 4.7 cluster. Chrissi, is there a solution or another bz for the problem ? -Mark On Wednesday 11 February 2009 10:17:30 Chrissie Caulfield wrote: > thijn wrote: > > Hi, > > > > I have the following problem. > > CMAN: removing node [server1] from the cluster : Missed too many > > heartbeats > > When the server comes back up: > > Feb 10 14:43:58 server1 kernel: CMAN: sending membership request > > after which it will try to join until the end of times. > > > > In the current problem, server2 is active and server1 has the problem > > not being able to join the cluster. > > > > The setup is a two server setup cluster. > > We have had the problem on several clusters. > > We "fixed" it usualy with rebooting the other node at which the cluster > > would repair itself and all ran smoothly from thereon. > > Naturally this will disrupt any services running on the cluster. And its > > not really a solution that will win prices. > > The problem is that server1(the problem one) is in a inquorate state and > > we are unable to get it to a quorate state, neither do we see why this > > is the case. > > We tried to use a test setup to replay the problem, we were unable. > > > > So we decided to try to find a way to fix the state of the cluster using > > the tools the system provides. > > > > The problem we see presents itself after a fence action by either node. > > When we would bring down both nodes to stabilize the issue, the cluster > > would become healthy and after that we can reboot either node and it > > will rejoin the cluster. > > It seems the problem presents itself when "pulling the plug" out of the > > server. > > We run on IBM Xservers using the SA-adapter as a fence device. > > The fence device is in a different subnet then the subnet on which the > > cluster communicates. > > Bot fence devices are on the same subnet/vlan. > > > > CentOS release 4.6 (Final) > > Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686 > > i686 i386 GNU/Linux > > cman_tool 1.0.17 (built Mar 20 2007 17:10:52) > > Copyright (C) Red Hat, Inc. 2004 All rights reserved. > > > > All versions of libraries and packages, kernel modules and all that is > > dependent for the GFS cluster to operate are identical on both nodes. > > > > Cluster.conf > > [root at server1 log]# cat /etc/cluster/cluster.conf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="saserver1" passwd="XXXXXXX"/> > > > name="saserver2" passwd="XXXXXXX"/> > > > > > > > > > > > > > > > > [root at server1 log]# cat /etc/hosts > > 127.0.0.1 localhost.localdomain localhost > > > > Both server are able to ping each other and also the broadcast address, > > so there is no firewall filtering UDP packets > > When i tcpdump the line i see traffic going both ways, > > > > Both servers are in the same vlan > > 14:51:28.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 56) server2.production.loc.6809 > > > broadcast.production.loc.6809: UDP, length 28 > > 14:51:28.703277 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 140) server1.production.loc.6809 > > > server2.production.loc.6809: UDP, length 112 > > 14:51:33.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 56) server2.production.loc.6809 > > > broadcast.production.loc.6809: UDP, length 28 > > 14:51:33.703310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 140) server1.production.loc.6809 > > > server2.production.loc.6809.6809: UDP, length 112 > > > > Is this normal network behavior when a cluster is inquorate? > > I see that server1 is talking to server2, but server2 is only talking in > > broadcasts. > > > > When i start of try to join the cluster > > Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed > > > > [root at server1 ~]# cman_tool status > > Protocol version: 5.0.1 > > Config version: 3 > > Cluster name: NAME_cluster > > Cluster ID: 64692 > > Cluster Member: No > > Membership state: Joining > > > > [root at server2 log]# cman_tool status > > Protocol version: 5.0.1 > > Config version: 3 > > Cluster name: RWSEems_cluster > > Cluster ID: 64692 > > Cluster Member: Yes > > Membership state: Cluster-Member > > Nodes: 1 > > Expected_votes: 1 > > Total_votes: 1 > > Quorum: 1 > > Active subsystems: 7 > > Node name: server2.production.loc > > Node ID: 2 > > Node addresses: server1.production.loc > > > > [root at server1 ~]# cman_tool nodes > > Node Votes Exp Sts Name > > > > [root at server2 log]# cman_tool nodes > > Node Votes Exp Sts Name > > 1 1 1 X server1.production.loc > > 2 1 1 M server2.production.loc > > > > When i start cman > > service cman start > > > > Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a > > Linux-cluster > > Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture > > via: CMAN/SM Plugin v1.1.7.4 > > Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate > > > > > > It seems to me that this should be fixable with the tools as provided > > with the RedHat Cluster Suite, without disturbing the running cluster. > > It seems quite insane if i need to restart my cluster to have it all > > working again.. kinda spoils the idea of running a cluster. > > This setup is running in a HA envirmoment and we can have nearly to no > > downtime. > > > > The logs on the healthy server (server2) does not mention/complain > > anything of errors when rebooting, restarting cman or when server1 want > > to join the cluster. > > We see no disallowed, refused or anything that server2 is not willing to > > play with server1 > > > > I have been looking at this thing for a while now.. am i missing > > anything? > > This is a known bug, see > > https://bugzilla.redhat.com/show_bug.cgi?id=475293 > > It's fixed in 4.7 or you can run a program to set up a workaround. > > Having said that I have heard reports of is still happening in some > circumstances ... but I don't have any more detail > > -- > > Chrissie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Dipl.-Ing. Mark Hlawatschek From ccaulfie at redhat.com Tue Mar 3 14:54:26 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Tue, 03 Mar 2009 14:54:26 +0000 Subject: [Linux-cluster] CMAN: sending membership request, unable to join cluster. In-Reply-To: <200903031537.53544.hlawatschek@atix.de> References: <1234342644.11142.142.camel@thijn.enovation.oper> <499297AA.8030200@redhat.com> <200903031537.53544.hlawatschek@atix.de> Message-ID: <49AD44A2.10309@redhat.com> Mark Hlawatschek wrote: > I'm seeing the same problem in a 4.7 cluster. > Chrissi, is there a solution or another bz for the problem ? > As I said (rather unhelpfully) in my email I have no more information than that. No-one has provided any more detail and I haven't managed to reproduce it. Not that I've had much chance with the work on RHEL5 and later code I've been doing. Actually, providing useful information for this is almost impossibly difficult anyway unless you're some sort of magician with kernel cores. We have one at Red Hat but, as you might imagine, he's in very high demand ;-) About the only thing that a normal mortal might be able to do that could be useful is a tcpdump of the node talking successfully to the cluster following by it going down and failing to join. If you're feeling a little more adventurous then compiling the code with DEBUG_COMMS enabled would help enormously too. It's possible that the workaround program I posted in the BZ might mitigate the problem a little, but without knowing much more about what is happening I can't honestly be sure. Chrissie > -Mark > > On Wednesday 11 February 2009 10:17:30 Chrissie Caulfield wrote: >> thijn wrote: >>> Hi, >>> >>> I have the following problem. >>> CMAN: removing node [server1] from the cluster : Missed too many >>> heartbeats >>> When the server comes back up: >>> Feb 10 14:43:58 server1 kernel: CMAN: sending membership request >>> after which it will try to join until the end of times. >>> >>> In the current problem, server2 is active and server1 has the problem >>> not being able to join the cluster. >>> >>> The setup is a two server setup cluster. >>> We have had the problem on several clusters. >>> We "fixed" it usualy with rebooting the other node at which the cluster >>> would repair itself and all ran smoothly from thereon. >>> Naturally this will disrupt any services running on the cluster. And its >>> not really a solution that will win prices. >>> The problem is that server1(the problem one) is in a inquorate state and >>> we are unable to get it to a quorate state, neither do we see why this >>> is the case. >>> We tried to use a test setup to replay the problem, we were unable. >>> >>> So we decided to try to find a way to fix the state of the cluster using >>> the tools the system provides. >>> >>> The problem we see presents itself after a fence action by either node. >>> When we would bring down both nodes to stabilize the issue, the cluster >>> would become healthy and after that we can reboot either node and it >>> will rejoin the cluster. >>> It seems the problem presents itself when "pulling the plug" out of the >>> server. >>> We run on IBM Xservers using the SA-adapter as a fence device. >>> The fence device is in a different subnet then the subnet on which the >>> cluster communicates. >>> Bot fence devices are on the same subnet/vlan. >>> >>> CentOS release 4.6 (Final) >>> Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686 >>> i686 i386 GNU/Linux >>> cman_tool 1.0.17 (built Mar 20 2007 17:10:52) >>> Copyright (C) Red Hat, Inc. 2004 All rights reserved. >>> >>> All versions of libraries and packages, kernel modules and all that is >>> dependent for the GFS cluster to operate are identical on both nodes. >>> >>> Cluster.conf >>> [root at server1 log]# cat /etc/cluster/cluster.conf >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> name="saserver1" passwd="XXXXXXX"/> >>> >> name="saserver2" passwd="XXXXXXX"/> >>> >>> >>> >>> >>> >>> >>> >>> [root at server1 log]# cat /etc/hosts >>> 127.0.0.1 localhost.localdomain localhost >>> >>> Both server are able to ping each other and also the broadcast address, >>> so there is no firewall filtering UDP packets >>> When i tcpdump the line i see traffic going both ways, >>> >>> Both servers are in the same vlan >>> 14:51:28.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto >>> 17, length: 56) server2.production.loc.6809 > >>> broadcast.production.loc.6809: UDP, length 28 >>> 14:51:28.703277 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto >>> 17, length: 140) server1.production.loc.6809 > >>> server2.production.loc.6809: UDP, length 112 >>> 14:51:33.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto >>> 17, length: 56) server2.production.loc.6809 > >>> broadcast.production.loc.6809: UDP, length 28 >>> 14:51:33.703310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto >>> 17, length: 140) server1.production.loc.6809 > >>> server2.production.loc.6809.6809: UDP, length 112 >>> >>> Is this normal network behavior when a cluster is inquorate? >>> I see that server1 is talking to server2, but server2 is only talking in >>> broadcasts. >>> >>> When i start of try to join the cluster >>> Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed >>> >>> [root at server1 ~]# cman_tool status >>> Protocol version: 5.0.1 >>> Config version: 3 >>> Cluster name: NAME_cluster >>> Cluster ID: 64692 >>> Cluster Member: No >>> Membership state: Joining >>> >>> [root at server2 log]# cman_tool status >>> Protocol version: 5.0.1 >>> Config version: 3 >>> Cluster name: RWSEems_cluster >>> Cluster ID: 64692 >>> Cluster Member: Yes >>> Membership state: Cluster-Member >>> Nodes: 1 >>> Expected_votes: 1 >>> Total_votes: 1 >>> Quorum: 1 >>> Active subsystems: 7 >>> Node name: server2.production.loc >>> Node ID: 2 >>> Node addresses: server1.production.loc >>> >>> [root at server1 ~]# cman_tool nodes >>> Node Votes Exp Sts Name >>> >>> [root at server2 log]# cman_tool nodes >>> Node Votes Exp Sts Name >>> 1 1 1 X server1.production.loc >>> 2 1 1 M server2.production.loc >>> >>> When i start cman >>> service cman start >>> >>> Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a >>> Linux-cluster >>> Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture >>> via: CMAN/SM Plugin v1.1.7.4 >>> Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate >>> >>> >>> It seems to me that this should be fixable with the tools as provided >>> with the RedHat Cluster Suite, without disturbing the running cluster. >>> It seems quite insane if i need to restart my cluster to have it all >>> working again.. kinda spoils the idea of running a cluster. >>> This setup is running in a HA envirmoment and we can have nearly to no >>> downtime. >>> >>> The logs on the healthy server (server2) does not mention/complain >>> anything of errors when rebooting, restarting cman or when server1 want >>> to join the cluster. >>> We see no disallowed, refused or anything that server2 is not willing to >>> play with server1 >>> >>> I have been looking at this thing for a while now.. am i missing >>> anything? >> This is a known bug, see >> >> https://bugzilla.redhat.com/show_bug.cgi?id=475293 >> >> It's fixed in 4.7 or you can run a program to set up a workaround. >> >> Having said that I have heard reports of is still happening in some >> circumstances ... but I don't have any more detail >> >> -- >> >> Chrissie >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- Chrissie From vu at sivell.com Tue Mar 3 17:36:11 2009 From: vu at sivell.com (vu pham) Date: Tue, 03 Mar 2009 11:36:11 -0600 Subject: [Linux-cluster] building clusters via command line Message-ID: <49AD6A8B.5070808@sivell.com> Hi all, I am new to RH cluster. I can use to ricci/luci tool to setup and configure clusters, but I prefer doing it via command lines. Do we have some documents as starting guide on how to use those ccs_tool, fence_tool ... to set up a new cluster ? Thanks for your advice, Vu -- From rajpurush at gmail.com Tue Mar 3 17:39:41 2009 From: rajpurush at gmail.com (Rajeev P) Date: Tue, 3 Mar 2009 23:09:41 +0530 Subject: [Linux-cluster] two node cluster partition In-Reply-To: <58aa8d780902251124t4a3d7855y2454f51ef759485@mail.gmail.com> References: <7a271b290902222219u709e0398t555d7770e3fb32fb@mail.gmail.com> <58aa8d780902251124t4a3d7855y2454f51ef759485@mail.gmail.com> Message-ID: <7a271b290903030939g12a13965idd8dfad4f9ba1a8@mail.gmail.com> Thanks for the link. It was quite explantory !! On 2/26/09, Flavio Junior wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Here (whole page): > > http://sources.redhat.com/cluster/wiki/FAQ/CMAN#two_node_dual > > And the entire FAQ is really explantory. > > - -- > > Fl?vio do Carmo J?nior aka waKKu > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.9 (MingW32) > Comment: http://getfiregpg.org > > iEYEARECAAYFAkmlmuoACgkQgyuXjr6dyksY5wCguMNp+0/1tsSN4mGzlLl7gvak > ia4AoMiZpCRguBA5Le1j67vz4hyHvfUU > =aqIx > -----END PGP SIGNATURE----- > > 2009/2/23 Rajeev P : > > Hi, > > > > I have question regarding a network partition in a 2 node cluster. > > > > Consider a 2-node cluster (node1 and node2) setup with a cross-cable for > > heartbeat and setup to use HP iLO as the fencing mechasim. In the event > of > > network partition (and this case assume that the cross cable was pulled > out) > > one of the node's succeed in fencing the other. The question is, when the > > fenced node on rebooting would it attempt fence the existing node if > > it can't communicate with it (since the heartbeat network is still down). > > > > For example consider that in the event of n/w partition, node1 fences > node2. > > Would node2 on rebooting (can't communicate with node1) would attempt to > > fence node1 and form a 1-node cluster. > > > > If yes, can this infinite fencing loop be prevented in 2-node with the > use > > of a qdisk which denies quorum to node rebooting? > > > > Regards, > > Rajeev > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From breaktime123 at yahoo.com Tue Mar 3 20:06:28 2009 From: breaktime123 at yahoo.com (break time) Date: Tue, 3 Mar 2009 12:06:28 -0800 (PST) Subject: [Linux-cluster] building clusters via command line In-Reply-To: <49AD6A8B.5070808@sivell.com> Message-ID: <387111.48006.qm@web111111.mail.gq1.yahoo.com> You can edit /etc/cluster/cluster.conf file?w/o using command line. ? Minh ? --- On Tue, 3/3/09, vu pham wrote: From: vu pham Subject: [Linux-cluster] building clusters via command line To: "linux clustering" Date: Tuesday, March 3, 2009, 9:36 AM Hi all, I am new to RH cluster. I can use to ricci/luci tool to setup and configure clusters, but I prefer doing it via command lines. Do we have some documents as starting guide on how to use those ccs_tool, fence_tool ... to set up a new cluster ? Thanks for your advice, Vu -- -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From vu at sivell.com Tue Mar 3 20:09:14 2009 From: vu at sivell.com (vu pham) Date: Tue, 03 Mar 2009 14:09:14 -0600 Subject: [Linux-cluster] building clusters via command line In-Reply-To: <387111.48006.qm@web111111.mail.gq1.yahoo.com> References: <387111.48006.qm@web111111.mail.gq1.yahoo.com> Message-ID: <49AD8E6A.5020002@sivell.com> Thanks, Minh. That' s what I am doing now. Found this page http://sources.redhat.com/cluster/wiki/FAQ/GeneralQuestions and am working on /etc/cluster/cluster.conf now. :) Thanks, Vu break time wrote: > > You can edit /etc/cluster/cluster.conf file w/o using command line. > > Minh > > > --- On *Tue, 3/3/09, vu pham //* wrote: > > From: vu pham > Subject: [Linux-cluster] building clusters via command line > To: "linux clustering" > Date: Tuesday, March 3, 2009, 9:36 AM > > Hi all, > > I am new to RH cluster. I can use to ricci/luci tool to setup and configure > clusters, but I prefer doing it via command lines. > Do we have some documents as starting guide on how to use those ccs_tool, > fence_tool ... to set up a new cluster ? > > Thanks for your advice, > > Vu > > -- > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Timothy.Ward at itt.com Tue Mar 3 23:15:15 2009 From: Timothy.Ward at itt.com (Ward, Timothy - SSD) Date: Tue, 3 Mar 2009 18:15:15 -0500 Subject: [Linux-cluster] fence_apc does not like "outlet" user Message-ID: I have an APC 7901 and have been using the fence_apc tool to test the functionality. In the past I've always used the "root" account to control the NPS but thats bad so I am trying to get an "outlet" user setup. Here are my results: [root at node1 ~]# fence_apc -a 10.1.1.12 -l root -n 3 -p root -o on -v Status check successful. Port 3 is ON Power On successful [root at node1 ~]# fence_apc -a 10.1.1.12 -l device -n 3 -p device -o on -v Status check successful. Port 3 is ON Power On successful [root at node1 ~]# fence_apc -a 10.1.1.12 -l cluster -n 3 -p cluster -o on -v Traceback (most recent call last): File "/sbin/fence_apc", line 829, in ? main() File "/sbin/fence_apc", line 327, in main backout(sock) # Return to control screen File "/sbin/fence_apc", line 393, in backout i, mo, txt = sock.expect(regex_list, TELNET_TIMEOUT) File "/usr/lib64/python2.4/telnetlib.py", line 620, in expect text = self.read_very_lazy() File "/usr/lib64/python2.4/telnetlib.py", line 400, in read_very_lazy raise EOFError, 'telnet connection closed' EOFError: telnet connection closed I tried this with CentOS 5.1 and 5.2. For users root=admin, device=device, cluster=outlet. The info in /tmp/apclog does not have anything that is obviously useful (I compared the device and cluster user outputs). I did notice the log does not fill until the fence_apc fails for the outlet "cluster" user. I also tried a different outlet user name, checked that the outlet user has access to all of the outlets. My next idea was to grit my teeth and upgrade the APC firmware. Any suggestions would be welcome. Thanks guys, Tim This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail. From jfriesse at redhat.com Wed Mar 4 09:48:23 2009 From: jfriesse at redhat.com (Jan Friesse) Date: Wed, 04 Mar 2009 10:48:23 +0100 Subject: [Linux-cluster] fence_apc does not like "outlet" user In-Reply-To: References: Message-ID: <49AE4E67.8020007@redhat.com> Timothy, in 5.3 is new agent, which is able to work with "outleter" user. Regards, Honza Ward, Timothy - SSD wrote: > I have an APC 7901 and have been using the fence_apc tool to test the functionality. In the past I've always used the "root" account to control the NPS but thats bad so I am trying to get an "outlet" user setup. Here are my results: > > [root at node1 ~]# fence_apc -a 10.1.1.12 -l root -n 3 -p root -o on -v > Status check successful. Port 3 is ON > Power On successful > > [root at node1 ~]# fence_apc -a 10.1.1.12 -l device -n 3 -p device -o on -v > Status check successful. Port 3 is ON > Power On successful > > [root at node1 ~]# fence_apc -a 10.1.1.12 -l cluster -n 3 -p cluster -o on -v > Traceback (most recent call last): > File "/sbin/fence_apc", line 829, in ? > main() > File "/sbin/fence_apc", line 327, in main > backout(sock) # Return to control screen > File "/sbin/fence_apc", line 393, in backout > i, mo, txt = sock.expect(regex_list, TELNET_TIMEOUT) > File "/usr/lib64/python2.4/telnetlib.py", line 620, in expect > text = self.read_very_lazy() > File "/usr/lib64/python2.4/telnetlib.py", line 400, in read_very_lazy > raise EOFError, 'telnet connection closed' > EOFError: telnet connection closed > > I tried this with CentOS 5.1 and 5.2. For users root=admin, device=device, cluster=outlet. > > The info in /tmp/apclog does not have anything that is obviously useful (I compared the device and cluster user outputs). I did notice the log does not fill until the fence_apc fails for the outlet "cluster" user. I also tried a different outlet user name, checked that the outlet user has access to all of the outlets. > > My next idea was to grit my teeth and upgrade the APC firmware. > > Any suggestions would be welcome. > > Thanks guys, > Tim > > This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. > Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of ITT Corporation. The recipient should check this e-mail and any attachments for the presence of viruses. ITT accepts no liability for any damage caused by any virus transmitted by this e-mail. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Alain.Moulle at bull.net Wed Mar 4 12:55:53 2009 From: Alain.Moulle at bull.net (Alain.Moulle) Date: Wed, 04 Mar 2009 13:55:53 +0100 Subject: [Linux-cluster] Which bugzilla number ? Message-ID: <49AE7A59.8040108@bull.net> Hi Would it be possible for you to to give me the bugzilla number ? Thanks a lot. Alain > Message: 1 > Date: Wed, 24 Sep 2008 17:00:24 -0400 > From: Lon Hohberger > Subject: Re: [Linux-cluster] CS5 / quorum disk configuration > > On Wed, 2008-09-24 at 14:40 +0200, Alain.Moulle wrote: > > Hi > > > > Something strange about qdisk configuration with CS5 : > > it seems that it is necessary to reboot the system after the > configuration > > of quorum disk (via mkqdisk and in cluster.conf) so that it works fine, > > otherwise, we encounter the problem of "Node x is undead" at first > > failover try. > > Whereas after reboot of both nodes, this problem definitely dissapears. > > There's a bug which should have been fixed in 5.2, but was reverted, > then later fixed again. > > A checkout from the RHEL5 git branch should work fine. > > -- Lon -------------- next part -------------- An HTML attachment was scrubbed... URL: From dougbunger at yahoo.com Wed Mar 4 15:34:16 2009 From: dougbunger at yahoo.com (Doug Bunger) Date: Wed, 4 Mar 2009 07:34:16 -0800 (PST) Subject: [Linux-cluster] Update cluster.conf in Fedora 10 Message-ID: <19264.57394.qm@web110202.mail.gq1.yahoo.com> I'm having trouble making the cluster aware of changes in Fedora 10 (x86_64).? The setup has three VMs accessing a shared, attached partition, formatted as GFS. ?? When modifying the cluster.conf and incrementing version number, I have to boot the nodes.? I've found some online resources that say use cman_tool and others that say use ccs_tool.? What is the correct commanf for Fedora 10? ?# cman_tool nodes ?Node Sts Inc Joined Name ?2 M 172 2009-03-04 08:39:52 gfs2 ?3 M 172 2009-03-04 08:39:52 gfs3 ?4 M 160 2009-03-04 08:39:52 gfs4 ?# head -2 /etc/cluster/cluster.conf ? ? ?# cman_tool version ?6.1.0 config 9 ?# vi /etc/cluster/cluster.conf ?# head -2 /etc/cluster/cluster.conf ? ? ?# cman_tool version -r 10 ?# cman_tool version ?6.1.0 config 9 ?# ccs_tool update /etc/cluster/cluster.conf ?Unknown command, update ?Try 'ccs_tool help' for help Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Wed Mar 4 15:41:46 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Wed, 04 Mar 2009 15:41:46 +0000 Subject: [Linux-cluster] Update cluster.conf in Fedora 10 In-Reply-To: <19264.57394.qm@web110202.mail.gq1.yahoo.com> References: <19264.57394.qm@web110202.mail.gq1.yahoo.com> Message-ID: <49AEA13A.9050103@redhat.com> Doug Bunger wrote: > I'm having trouble making the cluster aware of changes in Fedora 10 > (x86_64). The setup has three VMs accessing a shared, attached > partition, formatted as GFS. When modifying the cluster.conf and > incrementing version number, I have to boot the nodes. I've found some > online resources that say use cman_tool and others that say use > ccs_tool. What is the correct commanf for Fedora 10? > > # cman_tool nodes > Node Sts Inc Joined Name > 2 M 172 2009-03-04 08:39:52 gfs2 > 3 M 172 2009-03-04 08:39:52 gfs3 > 4 M 160 2009-03-04 08:39:52 gfs4 > # head -2 /etc/cluster/cluster.conf > > > # cman_tool version > 6.1.0 config 9 > # vi /etc/cluster/cluster.conf > # head -2 /etc/cluster/cluster.conf > > > # cman_tool version -r 10 > # cman_tool version > 6.1.0 config 9 > # ccs_tool update /etc/cluster/cluster.conf > Unknown command, update > Try 'ccs_tool help' for help > On Fedora10 you only need to issue a cman_tool -r command to update the configuration file version. ccs_tool was for updating ccsd, which is no longer used. That doesn't explain why your version hasn't updated, of course. It might be worth checking if there are any updated packages for fedora as it seems to work correctly on my test systems. Chrissie From spods at iinet.net.au Wed Mar 4 16:12:52 2009 From: spods at iinet.net.au (Stewart Walters) Date: Thu, 05 Mar 2009 01:12:52 +0900 Subject: [Linux-cluster] Update cluster.conf in Fedora 10 Message-ID: <2292.1236183172@iinet.net.au> As far as I'm aware, your not supposed to edit /etc/cluster/cluster.conf directly. Doing so will cause cman to detect an unregistered change to /etc/cluster/cluster.conf and roll back to the previous version (someone correct me if I'm wrong here). The correct procedure is to take a copy of the file, then increment number of the cluster version, then use ccs_tool update to increment the cluster. That is: - # cp /etc/cluster/cluster.conf /tmp # vi /tmp/cluster.conf [increment it's number to a number higher than all nodes in the cluster] # ccs_tool update /tmp/cluster.conf ccs_tool will then distribute the new file to all cluster nodes, and cause this new version to be backed up as well (for further rollbacks of /etc/cluster/cluster.conf). Regards, Stewart On Wed Mar 4 7:34 , Doug Bunger sent: >I'm having trouble making the cluster aware of changes in Fedora 10 (x86_64).?? The setup has three VMs accessing a shared, attached partition, formatted as GFS. ???? When modifying the cluster.conf and incrementing version number, I have to boot the nodes.?? I've found some online resources that say use cman_tool and others that say use ccs_tool.?? What is the correct commanf for Fedora 10? > >??# cman_tool nodes >??Node Sts Inc Joined Name >??2 M 172 2009-03-04 08:39:52 gfs2 >??3 M 172 2009-03-04 08:39:52 gfs3 >??4 M 160 2009-03-04 08:39:52 gfs4 >??# head -2 /etc/cluster/cluster.conf >?? >?? >??# cman_tool version >??6.1.0 config 9 >??# vi /etc/cluster/cluster.conf >??# head -2 > /etc/cluster/cluster.conf >?? >?? >??# cman_tool version -r 10 >??# cman_tool version >??6.1.0 config 9 >??# ccs_tool update /etc/cluster/cluster.conf >??Unknown command, update >??Try 'ccs_tool help' for help > >Thanks. > > > > From spods at iinet.net.au Wed Mar 4 16:19:32 2009 From: spods at iinet.net.au (Stewart Walters) Date: Thu, 05 Mar 2009 01:19:32 +0900 Subject: [Linux-cluster] Update cluster.conf in Fedora 10 Message-ID: <2398.1236183572@iinet.net.au> This is on Red Hat Enterprise Linux 5.2, so YMMV for Fedora 10. Regards, Stewart On Thu Mar 5 1:12 , Stewart Walters sent: >As far as I'm aware, your not supposed to edit /etc/cluster/cluster.conf directly. > >Doing so will cause cman to detect an unregistered change to >/etc/cluster/cluster.conf and roll back to the previous version (someone correct >me if I'm wrong here). > >The correct procedure is to take a copy of the file, then increment number of the >cluster version, then use ccs_tool update to increment the cluster. > >That is: - > ># cp /etc/cluster/cluster.conf /tmp ># vi /tmp/cluster.conf [increment it's number to a number higher than all nodes >in the cluster] ># ccs_tool update /tmp/cluster.conf > >ccs_tool will then distribute the new file to all cluster nodes, and cause this >new version to be backed up as well (for further rollbacks of >/etc/cluster/cluster.conf). > >Regards, > >Stewart > > > > > >On Wed Mar 4 7:34 , Doug Bunger sent: > >The setup has three VMs accessing a shared, attached partition, formatted as GFS. >> >> /etc/cluster/cluster.conf >> >>Thanks. >> >> >> >> > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster From dougbunger at yahoo.com Thu Mar 5 14:05:38 2009 From: dougbunger at yahoo.com (dougbunger at yahoo.com) Date: Thu, 5 Mar 2009 06:05:38 -0800 (PST) Subject: [Linux-cluster] Update cluster.conf in Fedora 10 Message-ID: <829210.558.qm@web110201.mail.gq1.yahoo.com> All nodes are up to date, booted this within the last 24 hours. The behavior of ccs_tool seems to have changed at some point between 5.3 (FC6/7-ish?) and F10: ? # cp /etc/cluster/cluster.conf /tmp ? # vi /tmp/cluster.conf ? # ccs_tool update /tmp/cluster.conf ? Unknown command, update ? Try 'ccs_tool help' for help The man page still describes the update function. To your point about not editing, I'm there seem to be other differences, as system-config-cluster will not present the Mgmt Tab nor enable the "Send To Cluster" button, which works on my production RHEL cluster.? Unfortunately, the F10 works well enough, as long as I'm willing to reboot the entire cluster anytime I need to make a change. --- On Wed, 3/4/09, Stewart Walters wrote: From: Stewart Walters Subject: Re: [Linux-cluster] Update cluster.conf in Fedora 10 To: Linux-cluster at redhat.com Date: Wednesday, March 4, 2009, 10:19 AM -----Inline Attachment Follows----- This is on Red Hat Enterprise Linux 5.2, so YMMV for Fedora 10. Regards, Stewart On Thu Mar? 5? 1:12 , Stewart Walters? sent: >As far as I'm aware, your not supposed to edit /etc/cluster/cluster.conf directly. > >Doing so will cause cman to detect an unregistered change to >/etc/cluster/cluster.conf and roll back to the previous version (someone correct >me if I'm wrong here). > >The correct procedure is to take a copy of the file, then increment number of the >cluster version, then use ccs_tool update to increment the cluster. > >That is: - > ># cp /etc/cluster/cluster.conf /tmp ># vi /tmp/cluster.conf? [increment it's number to a number higher than all nodes >in the cluster] ># ccs_tool update /tmp/cluster.conf > >ccs_tool will then distribute the new file to all cluster nodes, and cause this >new version to be backed up as well (for further rollbacks of >/etc/cluster/cluster.conf). > >Regards, > >Stewart > > > > > >On Wed Mar? 4? 7:34 , Doug Bunger? sent: > >The setup has three VMs accessing a shared, attached partition, formatted as GFS. >> >> /etc/cluster/cluster.conf >> >>Thanks. >> >> >> >>? ? ? > > >-- >Linux-cluster mailing list >Linux-cluster at redhat.com >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Thu Mar 5 15:59:09 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Thu, 05 Mar 2009 15:59:09 +0000 Subject: [Linux-cluster] Update cluster.conf in Fedora 10 In-Reply-To: <49AEA13A.9050103@redhat.com> References: <19264.57394.qm@web110202.mail.gq1.yahoo.com> <49AEA13A.9050103@redhat.com> Message-ID: <49AFF6CD.7090000@redhat.com> Chrissie Caulfield wrote: > Doug Bunger wrote: >> I'm having trouble making the cluster aware of changes in Fedora 10 >> (x86_64). The setup has three VMs accessing a shared, attached >> partition, formatted as GFS. When modifying the cluster.conf and >> incrementing version number, I have to boot the nodes. I've found some >> online resources that say use cman_tool and others that say use >> ccs_tool. What is the correct commanf for Fedora 10? >> >> # cman_tool nodes >> Node Sts Inc Joined Name >> 2 M 172 2009-03-04 08:39:52 gfs2 >> 3 M 172 2009-03-04 08:39:52 gfs3 >> 4 M 160 2009-03-04 08:39:52 gfs4 >> # head -2 /etc/cluster/cluster.conf >> >> >> # cman_tool version >> 6.1.0 config 9 >> # vi /etc/cluster/cluster.conf >> # head -2 /etc/cluster/cluster.conf >> >> >> # cman_tool version -r 10 >> # cman_tool version >> 6.1.0 config 9 >> # ccs_tool update /etc/cluster/cluster.conf >> Unknown command, update >> Try 'ccs_tool help' for help >> > > On Fedora10 you only need to issue a cman_tool -r command to update > the configuration file version. ccs_tool was for updating ccsd, which is > no longer used. > > That doesn't explain why your version hasn't updated, of course. It > might be worth checking if there are any updated packages for fedora as > it seems to work correctly on my test systems. > I've just tried this with the Fedora packages and had the same problem - so it seems like they are broken :-( The code upstream *does* work though. I don't know how long it will take those changes to reach the fedora10 packages, sorry. -- Chrissie From marcos.david at efacec.pt Thu Mar 5 16:42:20 2009 From: marcos.david at efacec.pt (Marcos David) Date: Thu, 05 Mar 2009 16:42:20 +0000 Subject: [Linux-cluster] Adding a new fence agent Message-ID: <49B000EC.9010407@efacec.pt> Hi, I modified the bladecenter fence agent script to work with an HP c7000 enclosure and i named it fence_c7000 and placed it in the/sbin dir. In order to use this as a fencing device I created the following entries in my cluster.conf: ** ** I then added a new fence level and this fence device to each node. Is this enough to used the new fence_c7000 script? Or is there another configuration to be done? Once I have a stable version, where can I upload it so it can be added to the cluster packages? Thanks. Marcos David -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.sturm at eprize.com Fri Mar 6 01:28:41 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Thu, 5 Mar 2009 20:28:41 -0500 Subject: [Linux-cluster] Strange directory listing Message-ID: <64D0546C5EBBD147B75DE133D798665F021BA36A@hugo.eprize.local> We keep Lucene search indexes on a GFS storage volume, mounted cluster-wide. This way each cluster node can perform a search, or append the search index with new content. Works great. Funny thing is, when I list the directory containing the search index, I sometimes see output like the following: $ ls -l total 600 ?--------- ? ? ? ? ? _2931.f3 ?--------- ? ? ? ? ? _2931.f4 -rw-r--r-- 1 toybox toybox 1260 Mar 5 20:21 _2931.f5 ?--------- ? ? ? ? ? _2931.f6 -rw-r--r-- 1 toybox toybox 1261 Mar 5 20:21 _2933.f1 -rw-r--r-- 1 toybox toybox 1261 Mar 5 20:21 _2933.f2 -rw-r--r-- 1 toybox toybox 1261 Mar 5 20:21 _2933.f3 -rw-r--r-- 1 toybox toybox 1261 Mar 5 20:21 _2933.f4 -rw-r--r-- 1 toybox toybox 1261 Mar 5 20:21 _2933.f5 -rw-r--r-- 1 toybox toybox 1261 Mar 5 20:21 _2933.f6 -rw-r--r-- 1 toybox toybox 226300 Mar 5 20:21 _2933.fdt -rw-r--r-- 1 toybox toybox 10088 Mar 5 20:21 _2933.fdx -rw-r--r-- 1 toybox toybox 51 Mar 5 20:21 _2933.fnm -rw-r--r-- 1 toybox toybox 60778 Mar 5 20:21 _2933.frq -rw-r--r-- 1 toybox toybox 43035 Mar 5 20:21 _2933.prx -rw-r--r-- 1 toybox toybox 1571 Mar 5 20:21 _2933.tii -rw-r--r-- 1 toybox toybox 118784 Mar 5 20:21 _2933.tis -rw-r--r-- 1 toybox toybox 0 Mar 5 20:21 commit.lock -rw-r--r-- 1 toybox toybox 0 Mar 5 20:21 deletable -rw-r--r-- 1 toybox toybox 18 Mar 5 20:21 segments -rw-r--r-- 1 toybox toybox 0 Mar 5 20:21 write.lock I would have expected to see something in each of the fields. Is this a clue that stat() is failing? If so, why? I've never seen anything like it on a non-GFS volume. -Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From mihirjha.hbti at gmail.com Fri Mar 6 05:36:17 2009 From: mihirjha.hbti at gmail.com (mihir jha) Date: Fri, 6 Mar 2009 11:06:17 +0530 Subject: [Linux-cluster] Strange directory listing In-Reply-To: <64D0546C5EBBD147B75DE133D798665F021BA36A@hugo.eprize.local> References: <64D0546C5EBBD147B75DE133D798665F021BA36A@hugo.eprize.local> Message-ID: <983d4bc00903052136w5be868e4vcd95272256cccd5f@mail.gmail.com> Hi All, Please let me know , from where to get libnet.so to compile heartbeat. Thanks and best regards, Mihir jha Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From spods at iinet.net.au Fri Mar 6 07:17:31 2009 From: spods at iinet.net.au (Stewart Walters) Date: Fri, 06 Mar 2009 16:17:31 +0900 Subject: [Linux-cluster] Strange directory listing Message-ID: <31113.1236323851@iinet.net.au> What does a `ls -ln' reveal? Are the UIDs/GIDs of the first three files the same as all the others? Regards, Stewart On Fri Mar 6 9:28 , Jeff Sturm sent: > > > > > >We keep Lucene >search indexes on a GFS storage volume, mounted cluster-wide.?? This way >each cluster node can perform a search, or append the search index with new >content.?? Works great. >?? >Funny thing is, >when I list the directory containing the search index, I sometimes see output >like the following: >?? >$ ls -l >total >600 >?--------- ? ??????????? >????????????????????? >??????????????????????? ? >_2931.f3 >?--------- ? ??????????? >????????????????????? >??????????????????????? ? >_2931.f4 >-rw-r--r-- 1 toybox toybox???? 1260 Mar?? 5 20:21 >_2931.f5 >?--------- ? ??????????? >????????????????????? >??????????????????????? ? >_2931.f6 >-rw-r--r-- 1 toybox toybox???? 1261 Mar?? 5 20:21 >_2933.f1 >-rw-r--r-- 1 toybox toybox???? 1261 Mar?? 5 20:21 >_2933.f2 >-rw-r--r-- 1 toybox toybox???? 1261 Mar?? 5 20:21 >_2933.f3 >-rw-r--r-- 1 toybox toybox???? 1261 Mar?? 5 20:21 >_2933.f4 >-rw-r--r-- 1 toybox toybox???? 1261 Mar?? 5 20:21 >_2933.f5 >-rw-r--r-- 1 toybox toybox???? 1261 Mar?? 5 20:21 >_2933.f6 >-rw-r--r-- 1 toybox toybox 226300 Mar?? 5 20:21 >_2933.fdt >-rw-r--r-- 1 toybox toybox?? 10088 Mar?? 5 20:21 >_2933.fdx >-rw-r--r-- 1 toybox toybox???????? 51 Mar?? 5 >20:21 _2933.fnm >-rw-r--r-- 1 toybox toybox?? 60778 Mar?? 5 20:21 >_2933.frq >-rw-r--r-- 1 toybox toybox?? 43035 Mar?? 5 20:21 >_2933.prx >-rw-r--r-- 1 toybox toybox???? 1571 Mar?? 5 20:21 >_2933.tii >-rw-r--r-- 1 toybox toybox 118784 Mar?? 5 20:21 >_2933.tis >-rw-r--r-- 1 toybox toybox?????????? 0 >Mar?? 5 20:21 commit.lock >-rw-r--r-- 1 toybox >toybox?????????? 0 Mar?? 5 20:21 deletable >-rw-r--r-- >1 toybox toybox???????? 18 Mar?? 5 20:21 >segments >-rw-r--r-- 1 toybox toybox?????????? 0 Mar?? >5 20:21 write.lock > >I would have >expected to see something in each of the fields.?? Is this??a clue that >stat() is failing??? If so, why??? I've never seen anything like it on a >non-GFS volume. >?? >-Jeff >?? From hyeyoung.cho at gmail.com Fri Mar 6 08:00:12 2009 From: hyeyoung.cho at gmail.com (Hyeyoung Cho) Date: Fri, 6 Mar 2009 17:00:12 +0900 Subject: [Linux-cluster] Multiple block devices of different gnbd servers can be one GFS? How? Message-ID: <9aeb56d10903060000t5db961bbke078eb9ac5b5dd92@mail.gmail.com> ** *Hi all.* *I have a question regarding GFS.* * * *Is it possible that the block devices on multiple gnbd servers were made to one GFS ? * * * *Of course, I already knew that GFS supports multiple gnbd servers for one block device to provide High Availability * *and GNBD clients can import multiple gnbd device. * *Ex)* * gnbd_export -d /dev/sdc2 -e gamma **?**U* * **gnbd_export -d /dev/sdb1 -e delta **?**c* * * *However, multiple hardwares(block devices) of each different nodes(multiple gnbd servers) can be imported as one volume storage of GFS? * *It was supported in GFS?* * * *If it would be supported how can I use it? * *( I should make a Logical Volume at every gnbd clients and make GFS on it ??? )* ** * * *What does play the part in GFS modules? * * * *Thanks in advance* *Hyeyoung Cho. * -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Fri Mar 6 11:17:59 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 06 Mar 2009 12:17:59 +0100 Subject: [Linux-cluster] Cluster 3.0.0.alpha7 release Message-ID: <1236338279.26957.35.camel@cerberus.int.fabbione.net> The cluster team and its community are proud to announce the 3.0.0.alpha7 release from the STABLE3 branch. The development cycle for 3.0.0 is about to end. The STABLE3 branch is now collecting only bug fixes and minimal update required to build on top of the latest upstream kernel/corosync/openais, we are getting closer and closer to a shiny new stable release. Everybody with test equipment and time to spare, is highly encouraged to download, install and test the 3.0.0.alpha releases and more important report problems. This is the time for people to make a difference and help us testing as much as possible. In order to build the 3.0.0.alpha7 release you will need: - corosync 0.94 (strongly recommended to use svn rev 1791 and not higher) - openais 0.93 (strongly recommended to use svn rev 1740 and not higher) - linux kernel 2.6.28.x (requires the latest release from the 2.6.28.x stable release) The new source tarball can be downloaded here: ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.alpha7.tar.gz https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.alpha7.tar.gz At the same location is now possible to find separated tarballs for fence-agents and resource-agents as previously announced (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.html) To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Happy clustering, Fabio Under the hood (from 3.0.0.alpha6): Abhijith Das (1): gfs-kernel: change __grab_cache_page to grab_cache_page_write_begin Christine Caulfield (2): cman: Send a new transition message when the config version is updated cman: Don't give up on the node config if getnameinfo fails David Teigland (3): fenced/dlm_controld/gfs_controld: don't exit from query thread fence_scsi: remove scsi_reserve and scsi_reserve_notify fence_tool: ls EXIT_FAILURE Fabio M. Di Nitto (2): build: fix typo cman: enable timestamp on logging by default Jan Friesse (6): fence_cisco: Added fence agent for Cisco MDS 9124/9134 fence_*.py: Fix no fencing.py based scripts to force stdout close fence_cisco_mds: Fix port handling fence_mcdata: Fix unexpected port state fence_ifmib: Brand new implementation of this agent fence_rps10: Removed RPS-10 agent Lon Hohberger (3): config: Move cluster.rng to better location rgmanager: Allow restart counters to work with central_processing rgmanager: Unbreak failover Ryan O'Hara (1): Add support for unfence operation. cman/daemon/cman-preconfig.c | 36 +- cman/daemon/commands.c | 3 + config/plugins/xml/cluster.rng | 2376 ----------------------------- config/tools/xml/cluster.rng | 2376 +++++++++++++++++++++++++++++ fence/agents/apc_snmp/fence_apc_snmp.py | 10 + fence/agents/cisco_mds/Makefile | 5 + fence/agents/cisco_mds/fence_cisco_mds.py | 114 ++ fence/agents/ifmib/README | 10 +- fence/agents/ifmib/fence_ifmib.py | 338 ++--- fence/agents/lib/Makefile | 2 +- fence/agents/lib/fencing.py.py | 60 + fence/agents/lib/fencing_snmp.py.py | 109 ++ fence/agents/mcdata/fence_mcdata.pl | 4 +- fence/agents/rps10/Makefile | 26 - fence/agents/rps10/rps10.c | 521 ------- fence/agents/rsa/fence_rsa.py | 11 + fence/agents/rsb/fence_rsb.py | 11 + fence/agents/scsi/Makefile | 14 +- fence/agents/scsi/fence_scsi.pl | 390 ++--- fence/agents/scsi/scsi_reserve.in | 337 ---- fence/agents/scsi/scsi_reserve_notify.in | 5 - fence/fence_node/Makefile | 2 +- fence/fence_tool/fence_tool.c | 2 +- fence/fenced/main.c | 4 +- fence/man/Makefile | 1 + fence/man/fence_cisco_mds.8 | 132 ++ fence/man/fence_ifmib.8 | 123 ++- gfs-kernel/src/gfs/ops_address.c | 2 +- group/dlm_controld/main.c | 4 +- group/gfs_controld/main.c | 4 +- rgmanager/ChangeLog | 5 + rgmanager/include/resgroup.h | 1 + rgmanager/include/restart_counter.h | 1 + rgmanager/src/daemons/groups.c | 25 +- rgmanager/src/daemons/restart_counter.c | 22 +- rgmanager/src/daemons/rg_event.c | 44 +- rgmanager/src/daemons/rg_state.c | 24 +- rgmanager/src/daemons/slang_event.c | 15 + 38 files changed, 3320 insertions(+), 3849 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From fdinitto at redhat.com Fri Mar 6 11:20:45 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 06 Mar 2009 12:20:45 +0100 Subject: [Linux-cluster] Re: [Cluster-devel] Cluster 3.0.0.alpha7 release In-Reply-To: <1236338279.26957.35.camel@cerberus.int.fabbione.net> References: <1236338279.26957.35.camel@cerberus.int.fabbione.net> Message-ID: <1236338445.26957.37.camel@cerberus.int.fabbione.net> On Fri, 2009-03-06 at 12:17 +0100, Fabio M. Di Nitto wrote: > > - corosync 0.94 (strongly recommended to use svn rev 1791 and not > higher) Sorry for the typo.. svn rev. 1792 should be used. Fabio -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From fdinitto at redhat.com Fri Mar 6 11:23:08 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 06 Mar 2009 12:23:08 +0100 Subject: [Linux-cluster] Update cluster.conf in Fedora 10 In-Reply-To: <19264.57394.qm@web110202.mail.gq1.yahoo.com> References: <19264.57394.qm@web110202.mail.gq1.yahoo.com> Message-ID: <1236338588.26957.41.camel@cerberus.int.fabbione.net> On Wed, 2009-03-04 at 07:34 -0800, Doug Bunger wrote: > I'm having trouble making the cluster aware of changes in Fedora 10 > (x86_64). The setup has three VMs accessing a shared, attached > partition, formatted as GFS. When modifying the cluster.conf and > incrementing version number, I have to boot the nodes. I've found > some online resources that say use cman_tool and others that say use > ccs_tool. What is the correct commanf for Fedora 10? > > Packages in Fedora 10 needs to be updated. I have already started the process, but it might take sometime. The way Christine described is the correct approach to update cluster.conf and reload the configuration. Fabio From fdinitto at redhat.com Fri Mar 6 11:25:23 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 06 Mar 2009 12:25:23 +0100 Subject: [Linux-cluster] Adding a new fence agent In-Reply-To: <49B000EC.9010407@efacec.pt> References: <49B000EC.9010407@efacec.pt> Message-ID: <1236338723.26957.44.camel@cerberus.int.fabbione.net> Hi Marcos, On Thu, 2009-03-05 at 16:42 +0000, Marcos David wrote: > > > Once I have a stable version, where can I upload it so it can be added > to the cluster packages? > Can you please mail it to Jan and Marek? They are the maintainers of all fence agents in our stack. New agents are always welcome! Thanks Fabio From gianluca.cecchi at gmail.com Fri Mar 6 11:30:10 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Fri, 6 Mar 2009 12:30:10 +0100 Subject: [Linux-cluster] Adding a new fence agent Message-ID: <561c252c0903060330t4dc0bc80nc5d3ee511ac4a745@mail.gmail.com> I have a c7000 too with two test blades I'm going to install. I'm available to test it if you like. My planned OS will be RedHat EL 5 U3 x86_64 with its clustersuite Blades will be 2 x BL685c G1 serving Oracle 10gR2 At this moment the fw version of the c7000 is 2.25, while iLo fw is 1.60 One question: I see that on standard blade fence agent (that is for IBM enclosures) you can issue a command such as "power off blade number 3". I imagine the same on this c7000 fence agent, correct? In this case can we say that commands on c7000 (onboard admin I presume) bypasses the iLO of the blades or not? So are they effectively an alternative to iLO commands if for example the iLo is broken, or not? One other thing. Probably you reported the cluster.conf xml layout only as an example, because at this moment the c7000 fence agent would try to do its work only if the iLO agent fails, so you have to invert their order inside the sections bye, Gianluca From marcos.david at efacec.pt Fri Mar 6 11:47:59 2009 From: marcos.david at efacec.pt (Marcos David) Date: Fri, 06 Mar 2009 11:47:59 +0000 Subject: [Linux-cluster] Adding a new fence agent In-Reply-To: <561c252c0903060330t4dc0bc80nc5d3ee511ac4a745@mail.gmail.com> References: <561c252c0903060330t4dc0bc80nc5d3ee511ac4a745@mail.gmail.com> Message-ID: <49B10D6F.7010704@efacec.pt> Hi, Yes you can issue a "poweroff server 3" to gracefully shutdown a blade (I'm using "poweroff server # force", to ensure a fast shutdown, without the "force" option sometimes the OS shutdown hanged and things didn't work properly. I'm not sure how it works internally, I'll have to shutdown or disable the iLO port manually to test it since it is internal to the blade enclosure. I'll be able to do more tests and clean up the script next week. I modified the script to ensure I have another fence level besides the iLO port. During failover tests we removed one of the blades and the cluster went "kaput" because it couldn't access the iLO port of the blade and fence it. With another fence level, if it fails to reach the iLO port, it will try to fence the node through the enclosure and the cluster recovers normally. In this perspective it works as a redundancy if the iLO port fails or there is no communication with the blade. Marcos David Gianluca Cecchi wrote: > I have a c7000 too with two test blades I'm going to install. > I'm available to test it if you like. > My planned OS will be RedHat EL 5 U3 x86_64 with its clustersuite > Blades will be 2 x BL685c G1 serving Oracle 10gR2 > At this moment the fw version of the c7000 is 2.25, while iLo fw is 1.60 > > One question: I see that on standard blade fence agent (that is for > IBM enclosures) you can issue a command such as "power off blade > number 3". > I imagine the same on this c7000 fence agent, correct? > In this case can we say that commands on c7000 (onboard admin I > presume) bypasses the iLO of the blades or not? So are they > effectively an alternative to iLO commands if for example the iLo is > broken, or not? > > One other thing. Probably you reported the cluster.conf xml layout > only as an example, because at this moment the c7000 fence agent would > try to do its work only if the iLO agent fails, so you have to invert > their order inside the sections > > bye, > Gianluca > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > From marcos.david at efacec.pt Fri Mar 6 11:48:39 2009 From: marcos.david at efacec.pt (Marcos David) Date: Fri, 06 Mar 2009 11:48:39 +0000 Subject: [Linux-cluster] Adding a new fence agent In-Reply-To: <1236338723.26957.44.camel@cerberus.int.fabbione.net> References: <49B000EC.9010407@efacec.pt> <1236338723.26957.44.camel@cerberus.int.fabbione.net> Message-ID: <49B10D97.8030406@efacec.pt> Ok, I'll mail it as soon as I clean up the code and do some more tests Marcos David Fabio M. Di Nitto wrote: > Hi Marcos, > > On Thu, 2009-03-05 at 16:42 +0000, Marcos David wrote: > >> Once I have a stable version, where can I upload it so it can be added >> to the cluster packages? >> >> > > Can you please mail it to Jan and Marek? They are the maintainers of all > fence agents in our stack. > > New agents are always welcome! > > Thanks > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpeterso at redhat.com Fri Mar 6 13:47:03 2009 From: rpeterso at redhat.com (Bob Peterson) Date: Fri, 6 Mar 2009 08:47:03 -0500 (EST) Subject: [Linux-cluster] Strange directory listing In-Reply-To: <1627107821.214381236347052724.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <1180593553.214441236347223937.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Jeff Sturm" wrote: | We keep Lucene search indexes on a GFS storage volume, mounted | cluster-wide. This way each cluster node can perform a search, or | append the search index with new content. Works great. | | Funny thing is, when I list the directory containing the search index, | I sometimes see output like the following: | | $ ls -l | total 600 | ?--------- ? ? ? ? ? _2931.f3 What level of GFS driver is this? Are you up2date or running a recent level? Because we had problems like this a long time ago, but they've since been fixed. For example, bugzilla bug #222299, circa 2007. The thing to do is "stat _2931.f3" on all nodes in the cluster and see if the "Inode" value is the same on all nodes. Regards, Bob Peterson Red Hat GFS From jeff.sturm at eprize.com Fri Mar 6 17:49:07 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 6 Mar 2009 12:49:07 -0500 Subject: [Linux-cluster] Strange directory listing In-Reply-To: <1180593553.214441236347223937.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1627107821.214381236347052724.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1180593553.214441236347223937.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <64D0546C5EBBD147B75DE133D798665F021BA397@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Bob Peterson > Sent: Friday, March 06, 2009 8:47 AM > To: linux clustering > Subject: Re: [Linux-cluster] Strange directory listing > > ----- "Jeff Sturm" wrote: > | We keep Lucene search indexes on a GFS storage volume, mounted > | cluster-wide. This way each cluster node can perform a search, or > | append the search index with new content. Works great. > | > | Funny thing is, when I list the directory containing the > search index, > | I sometimes see output like the following: > | > | $ ls -l > | total 600 > | ?--------- ? ? ? ? ? _2931.f3 > > What level of GFS driver is this? Are you up2date or running > a recent level? We aren't running Red Hat. We have: CentOS release 5.2 (Final) kmod-gfs-0.1.23-5.el5 > The thing to do is "stat > _2931.f3" on all nodes in the cluster and see if the "Inode" > value is the same on all nodes. The rogue files don't stay around long enough to stat() on all nodes. My assumption is that these are files just created or in the process of being destroyed, and I don't see them in two successive "ls -l" commands. I wasn't too concerned since this doesn't seem to have any negative impact on the application, but was curious nonetheless. Thanks, Jeff From rpeterso at redhat.com Fri Mar 6 17:59:34 2009 From: rpeterso at redhat.com (Bob Peterson) Date: Fri, 6 Mar 2009 12:59:34 -0500 (EST) Subject: [Linux-cluster] Strange directory listing In-Reply-To: <1635277376.246311236362110482.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <732180264.246891236362374011.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Jeff Sturm" wrote: | > What level of GFS driver is this? Are you up2date or running | > a recent level? | | We aren't running Red Hat. We have: | | CentOS release 5.2 (Final) | kmod-gfs-0.1.23-5.el5 Close enough. :7) That's a bit old, so it's possible it's the problem I pointed out. Sounds like it's not hurting you anyway. So hopefully this problem will go away the next time you update, to Centos5.3 or whatever. | The rogue files don't stay around long enough to stat() on all nodes. | My assumption is that these are files just created or in the process | of | being destroyed, and I don't see them in two successive "ls -l" | commands. | | I wasn't too concerned since this doesn't seem to have any negative | impact on the application, but was curious nonetheless. Regards, Bob Peterson From John.Michealson at metc.state.mn.us Fri Mar 6 23:18:14 2009 From: John.Michealson at metc.state.mn.us (Michealson, John) Date: Fri, 6 Mar 2009 17:18:14 -0600 Subject: [linux-cluster] mdadm device as qdisk Message-ID: Hello. I have a HP msa500 SCSI array and I was using a mdadm device as a quorum on a two node cluster until last week when I updated from 5.2 rhel to 5.3 - what I found was the quorum will no longer function seemingly due to the multipath. If I create a new quorum using a single physical path from the set that was md, mkqdisk -L reports back the correct label on both after a very long delay. Thoughts? -------------- next part -------------- An HTML attachment was scrubbed... URL: From vishal.bordia at gmail.com Sat Mar 7 04:20:55 2009 From: vishal.bordia at gmail.com (vishal bordia) Date: Fri, 6 Mar 2009 20:20:55 -0800 Subject: [Linux-cluster] Quorum Concept Message-ID: Dear, let me know how can i add a storge quorum in an existing cluster. i have an existing cluster of RHEL4.5 Servers -- Regards, Vishal Bordia HCL Infosystems Ltd. Mob :+91-9216883922 -------------- next part -------------- An HTML attachment was scrubbed... URL: From finnzi at finnzi.com Sat Mar 7 09:57:34 2009 From: finnzi at finnzi.com (=?ISO-8859-1?Q?Finnur_=D6rn_Gu=F0mundsson?=) Date: Sat, 07 Mar 2009 09:57:34 +0000 Subject: [Linux-cluster] Quorum Concept In-Reply-To: References: Message-ID: <49B2450E.70509@finnzi.com> On 3/7/09 4:20 AM, vishal bordia wrote: > > Dear, > let me know how can i add a storge quorum in an existing cluster. > i have an existing cluster of RHEL4.5 Servers > -- > Regards, > Vishal Bordia > HCL Infosystems Ltd. > Mob :+91-9216883922 > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi, You might want to read up on the Red Hat Cluster Suite manuals at redhat.com, but here is a nice article that explains the usage of the quorum disk pretty nicely - http://magazine.redhat.com/2007/12/19/enhancing-cluster-quorum-with-qdisk/ Bgrds, Finnzi -------------- next part -------------- An HTML attachment was scrubbed... URL: From John.Michealson at metc.state.mn.us Sat Mar 7 17:29:20 2009 From: John.Michealson at metc.state.mn.us (Michealson, John) Date: Sat, 7 Mar 2009 11:29:20 -0600 Subject: [Linux-cluster] Quorum Concept In-Reply-To: <49B2450E.70509@finnzi.com> References: <49B2450E.70509@finnzi.com> Message-ID: You have to do it manually thru the cluster.conf. The gui will not allow modifying the quorum settings except during cluster creation. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Finnur ?rn Gu?mundsson Sent: Saturday, March 07, 2009 3:58 AM To: linux clustering Subject: Re: [Linux-cluster] Quorum Concept On 3/7/09 4:20 AM, vishal bordia wrote: Dear, let me know how can i add a storge quorum in an existing cluster. i have an existing cluster of RHEL4.5 Servers -- Regards, Vishal Bordia HCL Infosystems Ltd. Mob :+91-9216883922 ________________________________ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster Hi, You might want to read up on the Red Hat Cluster Suite manuals at redhat.com, but here is a nice article that explains the usage of the quorum disk pretty nicely - http://magazine.redhat.com/2007/12/19/enhancing-cluster-quorum-with-qdisk/ Bgrds, Finnzi -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramiblanco at gmail.com Mon Mar 9 05:32:50 2009 From: ramiblanco at gmail.com (Ramiro Blanco) Date: Mon, 9 Mar 2009 03:32:50 -0200 Subject: [Linux-cluster] can't re-join cluster after upgrade Message-ID: <713aecdf0903082232w6c897e31s677084bd51c40d93@mail.gmail.com> Hi, I've just upgraded 1 of my 2-node cluster to RHEL 5.3 and now that node can't join the cluster. Can i upgrade 1 node at a time? here's the output of /var/log/messages: ... Mar 9 03:26:34 web1 ccsd[29129]: Starting ccsd 2.0.98: Mar 9 03:26:34 web1 ccsd[29129]: Built: Dec 3 2008 16:32:30 Mar 9 03:26:34 web1 ccsd[29129]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Mar 9 03:26:34 web1 ccsd[29129]: cluster.conf (cluster name = cluster_web, version = 3) found. Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from quorate node. Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from quorate node. Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from quorate node. Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from quorate node. Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 Mar 9 03:26:34 web1 openais[29135]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] AIS Executive Service: started and ready to provide service. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Using default multicast address of 239.192.73.137 Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_cpg loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais cluster closed process group service v1.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_cfg loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais configuration service' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_msg loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais message service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_lck loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais distributed locking service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_evt loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais event service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_ckpt loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais checkpoint service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_amf loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais availability management framework B.01.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_clm loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais cluster membership service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_evs loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais extended virtual synchrony service' Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component openais_cman loaded. Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service handler 'openais CMAN membership service 2.01' Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] send threads (0 threads) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP token expired timeout (495 ms) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP token problem counter (2000 ms) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP threshold (10 problem count) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP mode set to none. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] heartbeat_failures_allowed (0) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] max_network_delay (50 ms) Mar 9 03:26:34 web1 openais[29135]: [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes). Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Mar 9 03:26:34 web1 openais[29135]: [TOTEM] The network interface [192.168.10.3] is now up. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Created or loaded sequence id 280.192.168.10.3 for this ring. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering GATHER state from 15. Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais extended virtual synchrony service' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais cluster membership service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais availability management framework B.01.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais checkpoint service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais event service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais distributed locking service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais message service B.01.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais configuration service' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais cluster closed process group service v1.01' Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service handler 'openais CMAN membership service 2.01' Mar 9 03:26:34 web1 openais[29135]: [CMAN ] CMAN 2.0.98 (built Dec 3 2008 16:32:34) started Mar 9 03:26:34 web1 openais[29135]: [SYNC ] Not using a virtual synchrony filter. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Creating commit token because I am the rep. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Saving state aru 0 high seq received 0 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Storing new sequence id for ring 11c Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering COMMIT state. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering RECOVERY state. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] position [0] member 192.168.10.3: Mar 9 03:26:34 web1 openais[29135]: [TOTEM] previous ring seq 280 rep 192.168.10.3 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] aru 0 high delivered 0 received flag 1 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Did not need to originate any messages in recovery. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Sending initial ORF token Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) Mar 9 03:26:34 web1 openais[29135]: [SYNC ] This node is within the primary component and will provide service. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering OPERATIONAL state. Mar 9 03:26:34 web1 openais[29135]: [CMAN ] quorum regained, resuming activity Mar 9 03:26:34 web1 openais[29135]: [CLM ] got nodejoin message 192.168.10.3 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering GATHER state from 11. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Creating commit token because I am the rep. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Saving state aru a high seq received a Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Storing new sequence id for ring 120 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering COMMIT state. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering RECOVERY state. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] position [0] member 192.168.10.3: Mar 9 03:26:34 web1 openais[29135]: [TOTEM] previous ring seq 284 rep 192.168.10.3 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] aru a high delivered a received flag 1 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] position [1] member 192.168.10.4: Mar 9 03:26:34 web1 openais[29135]: [TOTEM] previous ring seq 284 rep 192.168.10.4 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] aru 8e high delivered 8e received flag 1 Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Did not need to originate any messages in recovery. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Sending initial ORF token Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.4) Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.4) Mar 9 03:26:34 web1 openais[29135]: [SYNC ] This node is within the primary component and will provide service. Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering OPERATIONAL state. Mar 9 03:26:34 web1 openais[29135]: [CLM ] got nodejoin message 192.168.10.3 Mar 9 03:26:34 web1 openais[29135]: [CLM ] got nodejoin message 192.168.10.4 .. Any help would be appreciated. -- Ramiro Blanco From fdinitto at redhat.com Mon Mar 9 10:49:16 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 09 Mar 2009 11:49:16 +0100 Subject: [Linux-cluster] Cluster 3.0.0.beta1 release Message-ID: <1236595756.17124.7.camel@cerberus.int.fabbione.net> The cluster team and its community are proud to announce the 3.0.0.beta1 release from the STABLE3 branch. The development cycle for 3.0.0 is almost completed. The STABLE3 branch is now collecting only bug fixes and minimal update required to build and run on top of the latest upstream kernel/corosync/openais. Everybody with test equipment and time to spare, is highly encouraged to download, install and test Beta releases and more important report problems. This is the time for people to make a difference and help us testing as much as possible. In order to build the 3.0.0.beta1 release you will need: - corosync 0.94 (strongly recommended to use svn rev 1794 or higher) - openais 0.93 (strongly recommended to use svn rev 1741 or higher) - linux kernel 2.6.28.x (requires the latest release from the 2.6.28.x stable release) The new source tarball can be downloaded here: ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.beta1.tar.gz https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.beta1.tar.gz At the same location is now possible to find separated tarballs for fence-agents and resource-agents as previously announced (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.html) To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Happy clustering, Fabio Under the hood (from 3.0.0.alpha7): Christine Caulfield (2): cman: remove the -f option from cman_tool nodes cman: mark libcman fencing APIs as deprecated. David Teigland (1): man pages: cluster.conf updates Fabio M. Di Nitto (4): init: silence ls output on error notifyd: fix memory leak in environment generation notifyd: don't leak memory if we fail to fork cman_notify libccs: small cleanup Lon Hohberger (3): Ancillary patch to fix another case; bz #327721 rgmanager: Remove sleep in startup of event thread rgmanager: Fix timeouts while trying to locate virtual machines Marek 'marx' Grac (1): fence_egenera: Allow fence_egenera to specify ssh login name cman/cman_tool/cman_tool.h | 1 - cman/cman_tool/main.c | 24 ------------------ cman/daemon/commands.c | 1 + cman/init.d/cman.in | 2 +- cman/lib/libcman.h | 43 ++++++++++++++++----------------- cman/man/cman_tool.8 | 5 ---- cman/notifyd/main.c | 20 ++++++++++++--- config/libs/libccsconfdb/libccs.c | 2 +- config/man/cluster.conf.5 | 19 ++++++++++---- fence/agents/egenera/fence_egenera.pl | 15 ++++++++--- rgmanager/src/daemons/rg_event.c | 13 ++------- rgmanager/src/daemons/rg_state.c | 7 +++-- 12 files changed, 71 insertions(+), 81 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From rajpurush at gmail.com Mon Mar 9 10:53:44 2009 From: rajpurush at gmail.com (Rajeev P) Date: Mon, 9 Mar 2009 16:23:44 +0530 Subject: [Linux-cluster] hardware watchdog timer Message-ID: <7a271b290903090353s4b3c7282xf5e2981ee04451d0@mail.gmail.com> Hi all, I am trying to setup the software "watchdog" daemon on a Proliant 585 G1 x86_64 clustered host to reset a node incase of hang. The man page says that the watchdog daemon can be used to use either a software WDT or a hardware based WDT. I intend to use the system's hardware based WDT (because of its reliability) The questions I have is : 1. How do I determine the system's hardware WDT present on the host, if present ? 2. How do I configure the watchdog daemon to use the identfied hardware WDT? Regards, Rajeev -------------- next part -------------- An HTML attachment was scrubbed... URL: From schlegel at riege.com Mon Mar 9 11:50:56 2009 From: schlegel at riege.com (Gunther Schlegel) Date: Mon, 09 Mar 2009 12:50:56 +0100 Subject: [Linux-cluster] can't re-join cluster after upgrade In-Reply-To: <713aecdf0903082232w6c897e31s677084bd51c40d93@mail.gmail.com> References: <713aecdf0903082232w6c897e31s677084bd51c40d93@mail.gmail.com> Message-ID: <49B502A0.5090704@riege.com> openais from 5.2 and 5.3 cannot talk top each other. There is a bugzilla ticket on this (but I cannot find the id right now) and several requests to RH support. While RH support still works on it preliminary information indicates that there won't be a fix for this, and I also dooubt that there will be a workaround. Shutting down amd restarting the entire cluster solves the problem. Installing openais from RHEL 5.2 will let the updated node join the cluster as well, if you want it up and can't shutdown node 2 as well. best regards, Gunther Ramiro Blanco wrote: > Hi, I've just upgraded 1 of my 2-node cluster to RHEL 5.3 and now that > node can't join the cluster. Can i upgrade 1 node at a time? > here's the output of /var/log/messages: > > ... > Mar 9 03:26:34 web1 ccsd[29129]: Starting ccsd 2.0.98: > Mar 9 03:26:34 web1 ccsd[29129]: Built: Dec 3 2008 16:32:30 > Mar 9 03:26:34 web1 ccsd[29129]: Copyright (C) Red Hat, Inc. 2004 > All rights reserved. > Mar 9 03:26:34 web1 ccsd[29129]: cluster.conf (cluster name = > cluster_web, version = 3) found. > Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from > quorate node. > Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from > quorate node. > Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from > quorate node. > Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote copy of cluster.conf is from > quorate node. > Mar 9 03:26:34 web1 ccsd[29129]: Local version # : 3 > Mar 9 03:26:34 web1 ccsd[29129]: Remote version #: 3 > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] AIS Executive Service > RELEASE 'subrev 1358 version 0.80.3' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Copyright (C) 2002-2006 > MontaVista Software, Inc and contributors. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Copyright (C) 2006 Red > Hat, Inc. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] AIS Executive Service: > started and ready to provide service. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Using default multicast > address of 239.192.73.137 > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_cpg loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais cluster closed process group service v1.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_cfg loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais configuration service' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_msg loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais message service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_lck loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais distributed locking service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_evt loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais event service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_ckpt loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais checkpoint service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_amf loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais availability management framework B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_clm loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais cluster membership service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_evs loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais extended virtual synchrony service' > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] openais component > openais_cman loaded. > Mar 9 03:26:34 web1 openais[29135]: [MAIN ] Registering service > handler 'openais CMAN membership service 2.01' > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Token Timeout (10000 ms) > retransmit timeout (495 ms) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] token hold (386 ms) > retransmits before loss (20 retrans) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] join (60 ms) send_join (0 > ms) consensus (4800 ms) merge (200 ms) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] downcheck (1000 ms) fail > to recv const (50 msgs) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] seqno unchanged const (30 > rotations) Maximum network MTU 1500 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] window size per rotation > (50 messages) maximum messages per rotation (17 messages) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] send threads (0 threads) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP token expired timeout > (495 ms) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP token problem counter > (2000 ms) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP threshold (10 problem > count) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] RRP mode set to none. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] > heartbeat_failures_allowed (0) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] max_network_delay (50 ms) > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] HeartBeat is Disabled. To > enable set heartbeat_failures_allowed > 0 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Receive multicast socket > recv buffer size (262142 bytes). > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Transmit multicast socket > send buffer size (262142 bytes). > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] The network interface > [192.168.10.3] is now up. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Created or loaded > sequence id 280.192.168.10.3 for this ring. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering GATHER state > from 15. > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais extended virtual synchrony service' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais cluster membership service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais availability management framework B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais checkpoint service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais event service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais distributed locking service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais message service B.01.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais configuration service' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais cluster closed process group service v1.01' > Mar 9 03:26:34 web1 openais[29135]: [SERV ] Initialising service > handler 'openais CMAN membership service 2.01' > Mar 9 03:26:34 web1 openais[29135]: [CMAN ] CMAN 2.0.98 (built Dec 3 > 2008 16:32:34) started > Mar 9 03:26:34 web1 openais[29135]: [SYNC ] Not using a virtual > synchrony filter. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Creating commit token > because I am the rep. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Saving state aru 0 high > seq received 0 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Storing new sequence id > for ring 11c > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering COMMIT state. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering RECOVERY state. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] position [0] member > 192.168.10.3: > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] previous ring seq 280 rep > 192.168.10.3 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] aru 0 high delivered 0 > received flag 1 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Did not need to originate > any messages in recovery. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Sending initial ORF token > Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE > Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE > Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) > Mar 9 03:26:34 web1 openais[29135]: [SYNC ] This node is within the > primary component and will provide service. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering OPERATIONAL state. > Mar 9 03:26:34 web1 openais[29135]: [CMAN ] quorum regained, resuming activity > Mar 9 03:26:34 web1 openais[29135]: [CLM ] got nodejoin message 192.168.10.3 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering GATHER state from 11. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Creating commit token > because I am the rep. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Saving state aru a high > seq received a > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Storing new sequence id > for ring 120 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering COMMIT state. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering RECOVERY state. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] position [0] member 192.168.10.3: > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] previous ring seq 284 rep > 192.168.10.3 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] aru a high delivered a > received flag 1 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] position [1] member 192.168.10.4: > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] previous ring seq 284 rep > 192.168.10.4 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] aru 8e high delivered 8e > received flag 1 > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Did not need to originate > any messages in recovery. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] Sending initial ORF token > Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE > Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] CLM CONFIGURATION CHANGE > Mar 9 03:26:34 web1 openais[29135]: [CLM ] New Configuration: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.3) > Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.4) > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Left: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] Members Joined: > Mar 9 03:26:34 web1 openais[29135]: [CLM ] r(0) ip(192.168.10.4) > Mar 9 03:26:34 web1 openais[29135]: [SYNC ] This node is within the > primary component and will provide service. > Mar 9 03:26:34 web1 openais[29135]: [TOTEM] entering OPERATIONAL state. > Mar 9 03:26:34 web1 openais[29135]: [CLM ] got nodejoin message 192.168.10.3 > Mar 9 03:26:34 web1 openais[29135]: [CLM ] got nodejoin message 192.168.10.4 > .. > > Any help would be appreciated. > > > -- Gunther Schlegel Manager IT Infrastructure ............................................................. Riege Software International GmbH Fon: +49 (2159) 9148 0 Mollsfeld 10 Fax: +49 (2159) 9148 11 40670 Meerbusch Web: www.riege.com Germany E-Mail: schlegel at riege.com --- --- Handelsregister: Managing Directors: Amtsgericht Neuss HRB-NR 4207 Christian Riege USt-ID-Nr.: DE120585842 Gabriele Riege Johannes Riege ............................................................. YOU CARE FOR FREIGHT, WE CARE FOR YOU -------------- next part -------------- A non-text attachment was scrubbed... Name: schlegel.vcf Type: text/x-vcard Size: 346 bytes Desc: not available URL: From ghodgens at inpses.co.uk Mon Mar 9 14:54:14 2009 From: ghodgens at inpses.co.uk (Graeme Hodgens) Date: Mon, 9 Mar 2009 14:54:14 -0000 Subject: [Linux-cluster] libmagic.so.1 download Message-ID: <6FA9DE3DC907DA408817927FF524F2C8931B20@inpseswin02.inpses.co.uk> I need to download a copy of the libmagic.so.1 package to complete my cluster configuration. Can anyone point me to a suitable download site; I can't find it on RHN? Thanks Graeme Graeme Hodgens INPS Tel: 01382 561010 Fax: 01382 562816 Direct Dial: 01382 564311 Registered Address: The Bread Factory, 1a Broughton Street, London SW8 3QJ Registered Number: 1788577 Registered in the UK Visit our Web site at www.inps.co.uk The information in this internet email is confidential and is intended solely for the addressee. Access, copying or re-use of information in it by anyone else is not authorised. Any views or opinions presented are solely those of the author and do not necessarily represent those of In Practice Systems Limited or any of its affiliates. If you are not the intended recipient please contact is.helpdesk at inps.co.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From vu at sivell.com Mon Mar 9 16:36:05 2009 From: vu at sivell.com (vu pham) Date: Mon, 09 Mar 2009 10:36:05 -0600 Subject: [Linux-cluster] libmagic.so.1 download In-Reply-To: <6FA9DE3DC907DA408817927FF524F2C8931B20@inpseswin02.inpses.co.uk> References: <6FA9DE3DC907DA408817927FF524F2C8931B20@inpseswin02.inpses.co.uk> Message-ID: <49B54575.7050503@sivell.com> Graeme Hodgens wrote: > I need to download a copy of the libmagic.so.1 package to complete my > cluster configuration. > > > > Can anyone point me to a suitable download site; I can?t find it on RHN? > [root at vm1 lib]# rpm -qf libmagic.so.1 file-4.17-15 yum install file Vu From ghodgens at inpses.co.uk Mon Mar 9 16:04:11 2009 From: ghodgens at inpses.co.uk (Graeme Hodgens) Date: Mon, 9 Mar 2009 16:04:11 -0000 Subject: [Linux-cluster] libmagic.so.1 download References: <6FA9DE3DC907DA408817927FF524F2C8931B20@inpseswin02.inpses.co.uk> <49B54575.7050503@sivell.com> Message-ID: <6FA9DE3DC907DA408817927FF524F2C8969185@inpseswin02.inpses.co.uk> Thank you for the swift reply but this didn't work unfortunately, still no install of libmagic, any other suggestions? -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of vu pham Sent: 09 March 2009 16:36 To: linux clustering Subject: Re: [Linux-cluster] libmagic.so.1 download Graeme Hodgens wrote: > I need to download a copy of the libmagic.so.1 package to complete my > cluster configuration. > > > > Can anyone point me to a suitable download site; I can't find it on RHN? > [root at vm1 lib]# rpm -qf libmagic.so.1 file-4.17-15 yum install file Vu -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From vu at sivell.com Mon Mar 9 18:50:32 2009 From: vu at sivell.com (vu pham) Date: Mon, 09 Mar 2009 12:50:32 -0600 Subject: [Linux-cluster] libmagic.so.1 download In-Reply-To: <6FA9DE3DC907DA408817927FF524F2C8969185@inpseswin02.inpses.co.uk> References: <6FA9DE3DC907DA408817927FF524F2C8931B20@inpseswin02.inpses.co.uk><49B54575.7050503@sivell.com> <6FA9DE3DC907DA408817927FF524F2C8969185@inpseswin02.inpses.co.uk> Message-ID: <49B564F8.5040208@sivell.com> Graeme Hodgens wrote: > Thank you for the swift reply but this didn't work unfortunately, still > no install of libmagic, any other suggestions? No problem. What do you mean "this didn't work" ? The package 'file' cannot be installed ? Or is it installed but you still do not have that lib file ? If it is installed, what is the output of "rpm -ql file" ? Btw, what is your server's version ? Vu > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of vu pham > Sent: 09 March 2009 16:36 > To: linux clustering > Subject: Re: [Linux-cluster] libmagic.so.1 download > > Graeme Hodgens wrote: >> I need to download a copy of the libmagic.so.1 package to complete my >> cluster configuration. >> >> >> >> Can anyone point me to a suitable download site; I can't find it on > RHN? > > [root at vm1 lib]# rpm -qf libmagic.so.1 > file-4.17-15 > > yum install file > > Vu > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Vu Pham Sivell Corporation 7155 Old Katy Rd. Suite 110 South Houston, TX 77024-2136 voice: 713-821-9800 ext 2203 fax: 713-821-9899 From breaktime123 at yahoo.com Mon Mar 9 18:13:46 2009 From: breaktime123 at yahoo.com (break time) Date: Mon, 9 Mar 2009 11:13:46 -0700 (PDT) Subject: [Linux-cluster] libmagic.so.1 download Message-ID: <439831.46190.qm@web111115.mail.gq1.yahoo.com> Dear Vu, ? First, sorry for spam ! ? I saw your post on OSDir, regarding to setup new cluster on RH5.3 ? "An error occured when trying to contact any of the nodes in the Ex1 cluster." ? ? I have the same issues with rh5.3. Do you have any suggestion to solve this problem? ? ? Thanks Minh ? ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From vu at sivell.com Mon Mar 9 19:23:51 2009 From: vu at sivell.com (vu pham) Date: Mon, 09 Mar 2009 13:23:51 -0600 Subject: [Linux-cluster] cluster / luci/ RHEL 5.3 In-Reply-To: <439831.46190.qm@web111115.mail.gq1.yahoo.com> References: <439831.46190.qm@web111115.mail.gq1.yahoo.com> Message-ID: <49B56CC7.9050806@sivell.com> break time wrote: > Dear Vu, > > First, sorry for spam ! > > I saw your post on OSDir, regarding to setup new cluster on RH5.3 > > "An error occured when trying to contact any of the nodes in the Ex1 > cluster." > > > I have the same issues with rh5.3. > Do you have any suggestion to solve this problem? > > > Thanks > Minh > Hi Minh, I posted that luci problem to RH support and got confirmed about that. Maybe RH will have a fixed package soon. Vu From breaktime123 at yahoo.com Mon Mar 9 18:27:25 2009 From: breaktime123 at yahoo.com (break time) Date: Mon, 9 Mar 2009 11:27:25 -0700 (PDT) Subject: [Linux-cluster] cluster / luci/ RHEL 5.3 Message-ID: <183153.40310.qm@web111102.mail.gq1.yahoo.com> ? Thanks ! ? ? ? --- On Mon, 3/9/09, vu pham wrote: From: vu pham Subject: [Linux-cluster] cluster / luci/ RHEL 5.3 To: "linux clustering" Date: Monday, March 9, 2009, 12:23 PM break time wrote: > Dear Vu, >? First, sorry for spam ! >? I saw your post on OSDir, regarding to setup new cluster on RH5.3 >? "An error occured when trying to contact any of the nodes in the Ex1 cluster." >???I have the same issues with rh5.3. > Do you have any suggestion to solve this problem? >???Thanks > Minh >? Hi Minh, I posted that luci problem to RH support and got confirmed about that. Maybe RH will have a fixed package soon. Vu -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From ramiblanco at gmail.com Mon Mar 9 21:40:24 2009 From: ramiblanco at gmail.com (Ramiro Blanco) Date: Mon, 9 Mar 2009 19:40:24 -0200 Subject: [Linux-cluster] can't re-join cluster after upgrade In-Reply-To: <49B502A0.5090704@riege.com> References: <713aecdf0903082232w6c897e31s677084bd51c40d93@mail.gmail.com> <49B502A0.5090704@riege.com> Message-ID: <713aecdf0903091440p5eeb09d9g907f08eb182d15db@mail.gmail.com> 2009/3/9 Gunther Schlegel : > openais from 5.2 and 5.3 cannot talk top each other. There is a bugzilla > ticket on this (but I cannot find the id right now) and several requests to > RH support. > > While RH support still works on it preliminary information indicates that > there won't be a fix for this, and I also dooubt that there will be a > workaround. > > Shutting down amd restarting the entire cluster solves the problem. > > Installing openais from RHEL 5.2 will let the updated node join the cluster > as well, if you want it up and can't shutdown node 2 as well. > > best regards, Gunther > Thank you, you were very clear. I've downgraded openais for now. Whenever i can, i'll upgrade both. Cheers! -- Ramiro Blanco From jeetendra.p at directi.com Tue Mar 10 05:46:23 2009 From: jeetendra.p at directi.com (Jeetendra) Date: Tue, 10 Mar 2009 11:16:23 +0530 Subject: [Linux-cluster] GFS2 vs EXT3 performance issue Message-ID: <001b01c9a143$8bc93570$a35ba050$@p@directi.com> Hi all, I am doing some postmark tests with gfs2 on a 2 node cluster using a iSCSI partition on a SAN device configured in multipath Im using 2 path in multipath and my multipath config is http://pastebin.com/me4facbb My multipath is using 2 Intel NIC (one onboard eth1 and one external eth2) which are conected to decdicated managed switch(HP) Im using iscsi-initiator-utils-6.2.0.870 and my iscsi Node conf. is # iscsiadm -m node 172.16.1.252:3260,1 iqn.1994-04.jp.co.xxxxxx:rsd.d7x.t.10126.0a000 172.16.2.252:3260,1 iqn.1994-04.jp.co.xxxxxx:rsd.d7x.t.10126.0b000 The primary NIC (eth0) is used for cluster setup and fencing which is on different nerwork other than multipath interface My kernel is 2.6.27.9 on a 64 bit system with cluster-2.03.10 and openais-0.80.3 installed from source on a centos 5.2 final box I have created GFS patition on the Iscsi disk using mkfs -t gfs2 -p lock_nolock -t alpha:gfs -j 4 /dev/mapper/disk1 And My cluster.conf is at http://pastebin.com/m209d6124 My postmark config is at http://pastebin.com/m76ba067 which creates and reads Huge no of files GFS2 is mounted with options "/gfsmount type gfs2 (rw,noatime,nodiratime,localflocks,localcaching)" Actually there are 25 postmark procs each writing and reading to their own locatioons in the same gfs2 mount from a single machine Im using lock_nolock but i have a feeling that locking is still taking place A lockdump shows zillions of entries like - H: s:SH f:EH e:0 p:4461 [postmark] gfs2_inode_lookup+0x113/0x1f5 [gfs2] and performance is too slow Lockdump result for above 2 test are similar and be seen at http://203.199.107.58/lockdump.txt The Result of GFS with dlm_lock is http://203.199.107.58/Iteration-1-Test2-gfs2-lock_dlm.html The Result of GFS with nolock and demote_secs set to 86400 is http://203.199.107.58/Iteration-1-Test2-gfs2-lock_nolock.html The Result for ext3 is at http://203.199.107.58/Iteration-1-Test2-ext3.html I have verified the glock trimming patch is applied to the code I have read reports saying GFS2 gives 95% of ext3 performance but my setup doesn't seem to get anywhere closer Any suggestions/tips on what i might be doing wrong ? Regards Jeetendra From schlegel at riege.com Tue Mar 10 07:37:40 2009 From: schlegel at riege.com (Gunther Schlegel) Date: Tue, 10 Mar 2009 08:37:40 +0100 Subject: [Linux-cluster] can't re-join cluster after upgrade to 5.3 / openais regression In-Reply-To: <713aecdf0903091440p5eeb09d9g907f08eb182d15db@mail.gmail.com> References: <713aecdf0903082232w6c897e31s677084bd51c40d93@mail.gmail.com> <49B502A0.5090704@riege.com> <713aecdf0903091440p5eeb09d9g907f08eb182d15db@mail.gmail.com> Message-ID: <49B618C4.2080103@riege.com> there seems to be some progress on the issue: https://bugzilla.redhat.com/show_bug.cgi?id=487214 best regards, Gunther Ramiro Blanco wrote: > 2009/3/9 Gunther Schlegel : >> openais from 5.2 and 5.3 cannot talk top each other. There is a bugzilla >> ticket on this (but I cannot find the id right now) and several requests to >> RH support. >> >> While RH support still works on it preliminary information indicates that >> there won't be a fix for this, and I also dooubt that there will be a >> workaround. >> >> Shutting down amd restarting the entire cluster solves the problem. >> >> Installing openais from RHEL 5.2 will let the updated node join the cluster >> as well, if you want it up and can't shutdown node 2 as well. >> >> best regards, Gunther >> > > Thank you, you were very clear. I've downgraded openais for now. > Whenever i can, i'll upgrade both. > Cheers! > > -- Gunther Schlegel Manager IT Infrastructure ............................................................. Riege Software International GmbH Fon: +49 (2159) 9148 0 Mollsfeld 10 Fax: +49 (2159) 9148 11 40670 Meerbusch Web: www.riege.com Germany E-Mail: schlegel at riege.com --- --- Handelsregister: Managing Directors: Amtsgericht Neuss HRB-NR 4207 Christian Riege USt-ID-Nr.: DE120585842 Gabriele Riege Johannes Riege ............................................................. YOU CARE FOR FREIGHT, WE CARE FOR YOU -------------- next part -------------- A non-text attachment was scrubbed... Name: schlegel.vcf Type: text/x-vcard Size: 346 bytes Desc: not available URL: From figaro at neo-info.net Tue Mar 10 09:12:49 2009 From: figaro at neo-info.net (Figaro Yang) Date: Tue, 10 Mar 2009 17:12:49 +0800 Subject: [Linux-cluster] some question about gnbd ? Message-ID: <024b01c9a160$62911f20$27b35d60$@net> Hi all ~ I have some question about gnbd ,would you please kindly give me some advice . Now, the market already has many Global file system which can support infiniBand RDMA such as Lustre File , NFS over RDMA ., but does GNBD has the some function? Any help would be appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mockey.chen at nsn.com Tue Mar 10 09:22:55 2009 From: mockey.chen at nsn.com (Mockey Chen) Date: Tue, 10 Mar 2009 17:22:55 +0800 Subject: [Linux-cluster] service dependence Message-ID: <49B6316F.8000802@nsn.com> Hi, I have some cluster services A, B, C, D. service B should be started after service A start properly. service C should be started after service B. how can I implement it ? Best regards. Mockey From figaro at neo-info.net Tue Mar 10 09:20:57 2009 From: figaro at neo-info.net (Figaro Yang) Date: Tue, 10 Mar 2009 17:20:57 +0800 Subject: [Linux-cluster] can't import gnbd ? Message-ID: <025701c9a161$8575f690$9061e3b0$@net> Hi , ALL .. I set up a File Cluster consist of 3 GNBD Server , using gnbd server for fencing , in case , when I on client node to gnbd_import , system have return some error message .. how can I solve it ? thank all .. [root at client ~]# modprobe gnbd [root at client ~]# gnbd_import -i io3 gnbd_import: ERROR cannot get node name : No such file or directory gnbd_import: ERROR If you are not planning to use a cluster manager, use -n [root at client ~]# tail /var/log/messages Mar 10 04:44:54 client smartd[3719]: Device: /dev/sda, is SMART capable. Adding to "monitor" list. Mar 10 04:44:54 client smartd[3719]: Monitoring 1 ATA and 0 SCSI devices Mar 10 04:44:55 client smartd[3721]: smartd has fork()ed into background mode. New PID=3721. Mar 10 04:44:55 client pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found Mar 10 04:44:55 client last message repeated 3 times Mar 10 04:44:56 client kernel: mtrr: type mismatch for d8000000,2000000 old: uncachable new: write-combining Mar 10 05:15:15 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] cman_init failed : No such file or directory Mar 10 05:15:17 client last message repeated 2 times Mar 10 05:15:21 client kernel: gnbd: registered device at major 252 Mar 10 05:15:24 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] cman_init failed : No such file or directory -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkprotocol at gmail.com Tue Mar 10 09:27:08 2009 From: rkprotocol at gmail.com (ramesh kasula) Date: Tue, 10 Mar 2009 14:57:08 +0530 Subject: [Linux-cluster] service dependence In-Reply-To: <49B6316F.8000802@nsn.com> References: <49B6316F.8000802@nsn.com> Message-ID: <98fa30820903100227wbba735fk1463b0712ce379d8@mail.gmail.com> plz remove my mail from ur group , iam getting frustration abt these mails Regards Ramesh kasula On Tue, Mar 10, 2009 at 2:52 PM, Mockey Chen wrote: > Hi, > > I have some cluster services A, B, C, D. > service B should be started after service A start properly. > service C should be started after service B. > > how can I implement it ? > > Best regards. > > Mockey > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robejrm at gmail.com Tue Mar 10 09:38:28 2009 From: robejrm at gmail.com (Juan Ramon Martin Blanco) Date: Tue, 10 Mar 2009 10:38:28 +0100 Subject: [Linux-cluster] service dependence In-Reply-To: <98fa30820903100227wbba735fk1463b0712ce379d8@mail.gmail.com> References: <49B6316F.8000802@nsn.com> <98fa30820903100227wbba735fk1463b0712ce379d8@mail.gmail.com> Message-ID: <8a5668960903100238m481249abu3bff99b4a3f00c33@mail.gmail.com> 2009/3/10 ramesh kasula > plz remove my mail from ur group , iam getting frustration abt these mails > > Please, do the removal on your own, as you did subscribe before. Use the url: https://www.redhat.com/mailman/listinfo/linux-cluster and login with your provided password (provided when you subscribed) Greetings, Juanra > Regards > Ramesh kasula > > On Tue, Mar 10, 2009 at 2:52 PM, Mockey Chen wrote: > >> Hi, >> >> I have some cluster services A, B, C, D. >> service B should be started after service A start properly. >> service C should be started after service B. >> >> how can I implement it ? >> >> Best regards. >> >> Mockey >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ghodgens at inpses.co.uk Tue Mar 10 10:00:18 2009 From: ghodgens at inpses.co.uk (Graeme Hodgens) Date: Tue, 10 Mar 2009 10:00:18 -0000 Subject: [Linux-cluster] libmagic.so.1 download References: <6FA9DE3DC907DA408817927FF524F2C8931B20@inpseswin02.inpses.co.uk><49B54575.7050503@sivell.com><6FA9DE3DC907DA408817927FF524F2C8969185@inpseswin02.inpses.co.uk> <49B564F8.5040208@sivell.com> Message-ID: <6FA9DE3DC907DA408817927FF524F2C896924E@inpseswin02.inpses.co.uk> Thank you for your input. I queried the file package and I had the wrong one for my operating system. Your solution was spot on. Thank you Graeme -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of vu pham Sent: 09 March 2009 18:51 To: linux clustering Subject: Re: [Linux-cluster] libmagic.so.1 download Graeme Hodgens wrote: > Thank you for the swift reply but this didn't work unfortunately, still > no install of libmagic, any other suggestions? No problem. What do you mean "this didn't work" ? The package 'file' cannot be installed ? Or is it installed but you still do not have that lib file ? If it is installed, what is the output of "rpm -ql file" ? Btw, what is your server's version ? Vu > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of vu pham > Sent: 09 March 2009 16:36 > To: linux clustering > Subject: Re: [Linux-cluster] libmagic.so.1 download > > Graeme Hodgens wrote: >> I need to download a copy of the libmagic.so.1 package to complete my >> cluster configuration. >> >> >> >> Can anyone point me to a suitable download site; I can't find it on > RHN? > > [root at vm1 lib]# rpm -qf libmagic.so.1 > file-4.17-15 > > yum install file > > Vu > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Vu Pham Sivell Corporation 7155 Old Katy Rd. Suite 110 South Houston, TX 77024-2136 voice: 713-821-9800 ext 2203 fax: 713-821-9899 -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From ghodgens at inpses.co.uk Tue Mar 10 11:03:47 2009 From: ghodgens at inpses.co.uk (Graeme Hodgens) Date: Tue, 10 Mar 2009 11:03:47 -0000 Subject: [Linux-cluster] Luci initialisation Message-ID: <6FA9DE3DC907DA408817927FF524F2C8969296@inpseswin02.inpses.co.uk> Hi When I try to initialise luci using luci_admin init, I am get the following, [root at myservername]# luci_admin init Initilaizing the Luci server Creating the 'admin' user Enter password: Confirm password: Please wait... [root at myservername]# At this point I am expecting a lot more information around the password being set and SSL certificates being set. I continue by restarting the service with service luci restart but get the following [root at myservername]# service luci restart Shutting down luci: [ OK ] luci's 'admin' password has to be changed before server is allowed to start To do so, execute (as root): luci_admin password [root at myservername]# luci_admin password The Luci site has not been initialized. To initialize it, execute /usr/sbin/luci_admin init And hey presto I am back to where I started. Any ideas if I am missing dependant packages or any other permissions, etc? Thanks Graeme -------------- next part -------------- An HTML attachment was scrubbed... URL: From esggrupos at gmail.com Tue Mar 10 12:00:52 2009 From: esggrupos at gmail.com (ESGLinux) Date: Tue, 10 Mar 2009 13:00:52 +0100 Subject: [Linux-cluster] fence device LOM Message-ID: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> Hello all, anyone knows if is it possible to use the integrated service processor with embedded Lights Out Management (LOM) standard that comes with a server SUN FIRE X2200 M2 ** -------------- next part -------------- An HTML attachment was scrubbed... URL: From esggrupos at gmail.com Tue Mar 10 12:06:00 2009 From: esggrupos at gmail.com (ESGLinux) Date: Tue, 10 Mar 2009 13:06:00 +0100 Subject: [Linux-cluster] fence device LOM In-Reply-To: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> References: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> Message-ID: <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> First Sorry for the previous message with the same subject (gmail decided to send it before i finish to write it), I continue with it Hello all, anyone knows if is it possible to use the integrated service processor with embedded Lights Out Management (LOM) standard that comes with a server SUN FIRE X2200 M2 as a fence device in a cluster. If its possible, How it can be configured? I think this is a a serial device, but I?m not sure. And if I?m right where must I plug the serial wire? I?m a bit lost with this device thanks ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Tue Mar 10 12:30:06 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 10 Mar 2009 12:30:06 +0000 Subject: [Linux-cluster] fence device LOM In-Reply-To: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> References: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> Message-ID: There's no reason why it couldn't be used, but I'm not sure if there is a fencing agent for it already in the distribution. You could, of course, write your own fencing agent for it, it's just a Perl script. Gordan On Tue, 10 Mar 2009 13:00:52 +0100, ESGLinux wrote: > Hello all, > > anyone knows if is it possible to use the integrated service processor with > embedded Lights Out Management (LOM) standard that comes with a server SUN > FIRE X2200 M2 ** From gordan at bobich.net Tue Mar 10 12:34:44 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 10 Mar 2009 12:34:44 +0000 Subject: [Linux-cluster] fence device LOM In-Reply-To: <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> References: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> Message-ID: On Tue, 10 Mar 2009 13:06:00 +0100, ESGLinux wrote: [SUN FIRE X2200 M2 as a fenceing device] > If its possible, How it can be configured? I think this is a a serial > device, but I?m not sure. And if I?m right where must I plug the serial > wire? I?m a bit lost with this device Sounds like your first task is to figure out how to use it. Once you've done that, it should be pretty obvious what you have to make the fencing agent do. If it's serial, you'll either need a network->serial switch of some sort, or if you only have two machines, you can cross connect the serial port on one to the LOM port on the other (assuming you have serial ports on those, which may not be the case). Having said that, AFAIK most of the recent machines come with both serial and LAN LOM, so you should be able to use the latter without any additional complications. Gordan From sghosh at redhat.com Tue Mar 10 12:56:41 2009 From: sghosh at redhat.com (Subhendu Ghosh) Date: Tue, 10 Mar 2009 08:56:41 -0400 Subject: [Linux-cluster] fence device LOM In-Reply-To: <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> References: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> Message-ID: <49B66389.4040202@redhat.com> ESGLinux wrote: > First > > Sorry for the previous message with the same subject (gmail decided to > send it before i finish to write it), I continue with it > > Hello all, > > anyone knows if is it possible to use the integrated service processor > with embedded Lights Out Management (LOM) standard that comes with a > server SUN FIRE X2200 M2 as a fence device in a cluster. > > If its possible, How it can be configured? I think this is a a serial > device, but I?m not sure. And if I?m right where must I plug the serial > wire? I?m a bit lost with this device > > thanks > > ESG > Check the specs - I thought most of the Sun LOM boards were IPMI capable. If so, fence_ipmi would work once the IPMI interface was configured. -subhendu From esggrupos at gmail.com Tue Mar 10 13:16:19 2009 From: esggrupos at gmail.com (ESGLinux) Date: Tue, 10 Mar 2009 14:16:19 +0100 Subject: [Linux-cluster] fence device LOM In-Reply-To: <49B66389.4040202@redhat.com> References: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> <49B66389.4040202@redhat.com> Message-ID: <3128ba140903100616t1a9dac1et84c536fd6493d59f@mail.gmail.com> Thanks to all, I have checked the specs and I agree with you, In this specs you can read: IPMI 2.0 compliant Service Processor with embedded Lights Out Management offering remote power so I can use it now I have the harder task to make it works ;-) I?ll tell you when I?ll have it, Thanks again Greetings ESG 2009/3/10 Subhendu Ghosh > ESGLinux wrote: > >> First >> >> Sorry for the previous message with the same subject (gmail decided to >> send it before i finish to write it), I continue with it >> >> Hello all, >> >> anyone knows if is it possible to use the integrated service processor >> with embedded Lights Out Management (LOM) standard that comes with a server >> SUN FIRE X2200 M2 as a fence device in a cluster. >> >> If its possible, How it can be configured? I think this is a a serial >> device, but I?m not sure. And if I?m right where must I plug the serial >> wire? I?m a bit lost with this device >> >> thanks >> >> ESG >> >> > Check the specs - I thought most of the Sun LOM boards were IPMI capable. > If so, fence_ipmi would work once the IPMI interface was configured. > > -subhendu > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emrys at eecs.berkeley.edu Tue Mar 10 17:16:35 2009 From: emrys at eecs.berkeley.edu (Emrys Ingersoll) Date: Tue, 10 Mar 2009 10:16:35 -0700 Subject: [Linux-cluster] fence device LOM In-Reply-To: <3128ba140903100616t1a9dac1et84c536fd6493d59f@mail.gmail.com> References: <3128ba140903100500t6bf6a7b4kd629c0e8d13c5a0c@mail.gmail.com> <3128ba140903100506j5fdede6btd41f421320cebca7@mail.gmail.com> <49B66389.4040202@redhat.com> <3128ba140903100616t1a9dac1et84c536fd6493d59f@mail.gmail.com> Message-ID: <20090310171635.GH9870@eecs.berkeley.edu> I have a two-node CentOS 5.2 cluster running on Sun X4100 M2s which I assume use a similar, if not identical, Service Processor. I am successfully using IPMI for fencing, here's a snippet of my cluster.conf with the relevant info: Good luck, Emrys On Tue, Mar 10, 2009 at 02:16:19PM +0100, ESGLinux wrote: > Thanks to all, > > I have checked the specs and I agree with you, > > In this specs you can read: > > IPMI 2.0 compliant Service Processor with embedded Lights Out Management > offering remote power > > so I can use it > > now I have the harder task to make it works ;-) I?ll tell you when I?ll > have it, > > Thanks again > > Greetings > > ESG > > 2009/3/10 Subhendu Ghosh <[1]sghosh at redhat.com> > > ESGLinux wrote: > > First > > Sorry for the previous message with the same subject (gmail decided to > send it before i finish to write it), I continue with it > > Hello all, > > anyone knows if is it possible to use the integrated service processor > with embedded Lights Out Management (LOM) standard that comes with a > server SUN FIRE X2200 M2 as a fence device in a cluster. > > If its possible, How it can be configured? I think this is a a serial > device, but I?m not sure. And if I?m right where must I plug the > serial wire? I?m a bit lost with this device > > thanks > > ESG > > Check the specs - I thought most of the Sun LOM boards were IPMI > capable. If so, fence_ipmi would work once the IPMI interface was > configured. > -subhendu > > -- > Linux-cluster mailing list > [2]Linux-cluster at redhat.com > [3]https://www.redhat.com/mailman/listinfo/linux-cluster > > References > > Visible links > 1. mailto:sghosh at redhat.com > 2. mailto:Linux-cluster at redhat.com > 3. https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Emrys Ingersoll Information Management Group Electrical Engineering and Computer Sciences Department University of California, Berkeley (510) 642-5495 (w) http://www.eecs.berkeley.edu/~emrys -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: From bmarzins at redhat.com Tue Mar 10 17:26:49 2009 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 10 Mar 2009 12:26:49 -0500 Subject: [Linux-cluster] can't import gnbd ? In-Reply-To: <025701c9a161$8575f690$9061e3b0$@net> References: <025701c9a161$8575f690$9061e3b0$@net> Message-ID: <20090310172649.GI32340@ether.msp.redhat.com> On Tue, Mar 10, 2009 at 05:20:57PM +0800, Figaro Yang wrote: > Hi , ALL .. > > > > I set up a File Cluster consist of 3 GNBD Server , using gnbd server for > fencing , in case , when I on client node to gnbd_import , system have > return some error message .. how can I solve it ? thank all .. > > > > [root at client ~]# modprobe gnbd > > [root at client ~]# gnbd_import -i io3 > > gnbd_import: ERROR cannot get node name : No such file or directory > > gnbd_import: ERROR If you are not planning to use a cluster manager, use > -n Did you start up the cman first? Unless you are planning to use gnbd outside of a cluster (and if you want fencing, you need to use run gnbd in a cluster), you need to start cman first. gnbd_import failed when it called cman_init(). The most likely reason for this is that your cluster is not started yet. -Ben > > [root at client ~]# tail /var/log/messages > > Mar 10 04:44:54 client smartd[3719]: Device: /dev/sda, is SMART capable. > Adding to "monitor" list. > > Mar 10 04:44:54 client smartd[3719]: Monitoring 1 ATA and 0 SCSI devices > > Mar 10 04:44:55 client smartd[3721]: smartd has fork()ed into background > mode. New PID=3721. > > Mar 10 04:44:55 client pcscd: winscard.c:304:SCardConnect() Reader E-Gate > 0 0 Not Found > > Mar 10 04:44:55 client last message repeated 3 times > > Mar 10 04:44:56 client kernel: mtrr: type mismatch for d8000000,2000000 > old: uncachable new: write-combining > > Mar 10 05:15:15 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] > cman_init failed : No such file or directory > > Mar 10 05:15:17 client last message repeated 2 times > > Mar 10 05:15:21 client kernel: gnbd: registered device at major 252 > > Mar 10 05:15:24 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] > cman_init failed : No such file or directory > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bsd_daemon at msn.com Tue Mar 10 17:55:36 2009 From: bsd_daemon at msn.com (Mehmet CELIK) Date: Tue, 10 Mar 2009 17:55:36 +0000 Subject: [Linux-cluster] can't import gnbd ? In-Reply-To: <025701c9a161$8575f690$9061e3b0$@net> References: <025701c9a161$8575f690$9061e3b0$@net> Message-ID: Hi, Did you start cman service ? Or you can use "gnbd_serv -n".. Additionally, each node in Cluster must resolve other node names from hosts file. Note that. -- Regards From: figaro at neo-info.net To: linux-cluster at redhat.com Date: Tue, 10 Mar 2009 17:20:57 +0800 Subject: [Linux-cluster] can't import gnbd ? Hi , ALL .. I set up a File Cluster consist of 3 GNBD Server , using gnbd server for fencing , in case , when I on client node to gnbd_import , system have return some error message .. how can I solve it ? thank all .. [root at client ~]# modprobe gnbd [root at client ~]# gnbd_import -i io3 gnbd_import: ERROR cannot get node name : No such file or directory gnbd_import: ERROR If you are not planning to use a cluster manager, use -n [root at client ~]# tail /var/log/messages Mar 10 04:44:54 client smartd[3719]: Device: /dev/sda, is SMART capable. Adding to "monitor" list. Mar 10 04:44:54 client smartd[3719]: Monitoring 1 ATA and 0 SCSI devices Mar 10 04:44:55 client smartd[3721]: smartd has fork()ed into background mode. New PID=3721. Mar 10 04:44:55 client pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found Mar 10 04:44:55 client last message repeated 3 times Mar 10 04:44:56 client kernel: mtrr: type mismatch for d8000000,2000000 old: uncachable new: write-combining Mar 10 05:15:15 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] cman_init failed : No such file or directory Mar 10 05:15:17 client last message repeated 2 times Mar 10 05:15:21 client kernel: gnbd: registered device at major 252 Mar 10 05:15:24 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] cman_init failed : No such file or directory _________________________________________________________________ Kendinizi ifade edin: giri? sayfan?z? Live.com ile istedi?iniz bi?imde tasarlay?n. http://www.live.com/getstarted -------------- next part -------------- An HTML attachment was scrubbed... URL: From figaro at neo-info.net Wed Mar 11 02:45:00 2009 From: figaro at neo-info.net (Figaro Yang) Date: Wed, 11 Mar 2009 10:45:00 +0800 Subject: [Linux-cluster] can't import gnbd ? In-Reply-To: <20090310172649.GI32340@ether.msp.redhat.com> References: <025701c9a161$8575f690$9061e3b0$@net> <20090310172649.GI32340@ether.msp.redhat.com> Message-ID: <014201c9a1f3$5f735a90$1e5a0fb0$@net> Dear Ben : Thank you for your kindly reply. I still have another question as follows: the host client is one of GNBD client in my cluster, does it need to start cman service at client host? The host client is not suppose to be included in /etc/cluster/cluster.conf. My cluster architecture is as below: Hosta ( GFS / GNBD Server ) ---------| Hostb ( GFS / GNBD Server ) ---------| ------------ SAN Storage Hostc ( GFS / GNBD Server ) ---------| | | | --------------- client ( GNBD Client ) -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Benjamin Marzinski Sent: Wednesday, March 11, 2009 1:27 AM To: linux clustering Subject: Re: [Linux-cluster] can't import gnbd ? On Tue, Mar 10, 2009 at 05:20:57PM +0800, Figaro Yang wrote: > Hi , ALL .. > > > > I set up a File Cluster consist of 3 GNBD Server , using gnbd server for > fencing , in case , when I on client node to gnbd_import , system have > return some error message .. how can I solve it ? thank all .. > > > > [root at client ~]# modprobe gnbd > > [root at client ~]# gnbd_import -i io3 > > gnbd_import: ERROR cannot get node name : No such file or directory > > gnbd_import: ERROR If you are not planning to use a cluster manager, use > -n Did you start up the cman first? Unless you are planning to use gnbd outside of a cluster (and if you want fencing, you need to use run gnbd in a cluster), you need to start cman first. gnbd_import failed when it called cman_init(). The most likely reason for this is that your cluster is not started yet. -Ben > > [root at client ~]# tail /var/log/messages > > Mar 10 04:44:54 client smartd[3719]: Device: /dev/sda, is SMART capable. > Adding to "monitor" list. > > Mar 10 04:44:54 client smartd[3719]: Monitoring 1 ATA and 0 SCSI devices > > Mar 10 04:44:55 client smartd[3721]: smartd has fork()ed into background > mode. New PID=3721. > > Mar 10 04:44:55 client pcscd: winscard.c:304:SCardConnect() Reader E-Gate > 0 0 Not Found > > Mar 10 04:44:55 client last message repeated 3 times > > Mar 10 04:44:56 client kernel: mtrr: type mismatch for d8000000,2000000 > old: uncachable new: write-combining > > Mar 10 05:15:15 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] > cman_init failed : No such file or directory > > Mar 10 05:15:17 client last message repeated 2 times > > Mar 10 05:15:21 client kernel: gnbd: registered device at major 252 > > Mar 10 05:15:24 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] > cman_init failed : No such file or directory > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From andremailinglist at gmail.com Wed Mar 11 12:10:29 2009 From: andremailinglist at gmail.com (Andrew Hole) Date: Wed, 11 Mar 2009 12:10:29 +0000 Subject: [Linux-cluster] distributed cluster Message-ID: <640358900903110510y58f7147ag468182d96dc11c95@mail.gmail.com> Hello! I am trying to find the best approach for creating a distributed cluster based on the following specs: - 2 datacenters (close to each other and connected by fiber channel); - 2 storages (one on each datacenter); - 1 server on each data center In an event of a failure of a single datacenter, the remaining datacenter must be responsible for assuring all services. My approach would be to (I don?t exactly know if it is possible): - Present one LUN from each storage to both servers: server A (DC A) - LUN A (STORAGE A) LUN B (STORAGE B) server B (DC B) - LUN B (STORAGE B) LUN A (STORAGE A) - Create one logical volume on each of the servers using RAID1: server A (DC A) - LVM A /mydata [RAID1 de LUN A e LUN B] server B (DC B) - LVM B /mydata [RAID1 de LUN B e LUN A] - Configure a RedHat Cluster; - Put /mydata on the cluster; Is this a feasible approach? Is there a better option? A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rmccabe at redhat.com Wed Mar 11 19:31:07 2009 From: rmccabe at redhat.com (Ryan McCabe) Date: Wed, 11 Mar 2009 15:31:07 -0400 Subject: [Linux-cluster] Luci initialisation In-Reply-To: <6FA9DE3DC907DA408817927FF524F2C8969296@inpseswin02.inpses.co.uk> References: <6FA9DE3DC907DA408817927FF524F2C8969296@inpseswin02.inpses.co.uk> Message-ID: <20090311193106.GA26372@redhat.com> On Tue, Mar 10, 2009 at 11:03:47AM -0000, Graeme Hodgens wrote: > [root at myservername]# luci_admin password > > The Luci site has not been initialized. > > To initialize it, execute > > /usr/sbin/luci_admin init > > > > And hey presto I am back to where I started. > > > > Any ideas if I am missing dependant packages or any other permissions, > etc? What OS version and luci version are you using? Ryan From billpp at gmail.com Wed Mar 11 20:02:30 2009 From: billpp at gmail.com (Flavio Junior) Date: Wed, 11 Mar 2009 17:02:30 -0300 Subject: [Linux-cluster] distributed cluster In-Reply-To: <640358900903110510y58f7147ag468182d96dc11c95@mail.gmail.com> References: <640358900903110510y58f7147ag468182d96dc11c95@mail.gmail.com> Message-ID: <58aa8d780903111302j512e4df5p639f4d056e44a5ad@mail.gmail.com> I'm doing something similar... I'm using: - 4 IBM x3550 (four nodes) - 2 DS4700 storage - 4 Fiber Channel Switches (2 on each site) What I do: I bought storage ERM (Enhanced Remote Mirror) feature and Brocade Full Fabric feature (to comunicate between sites over switches) I have LUN's active on site1 and backup on site2 and vice-versa. You will need to use/configure a multipath I/O driver (IBM uses RDAC, EMC powerpath... go on) When both storages are configured to mirror LUN1 I see 4 paths to access LUN1: - Site1 -> Controller_A -> LUN1 (Priority: 0) - Site1 -> Controller_B -> LUN1 (Priority: 1) - Site2 -> Controller_A -> LUN1 (Priority: 2) - Site2 -> Controller_B -> LUN1 (Priority: 3) In really, I can only use one path each time, but if one path goes down, I follow to next one by priority. I'm using a mirrored qdisk device too, in the same way. But, if site1 loose comunication with site2 (and vice-versa), local qdisk device can establish the quorum and provide service. In my case, I'll never write data on both sites if loose contact each other. The project is not complete, but I did some tests and dont see any big problem with this setup. You can see a draft of scenario here: http://www.uploadimagens.com/upload/2b58a80ce8c93c205109c9703b6740f5.jpg -- Fl?vio do Carmo J?nior aka waKKu 2009/3/11 Andrew Hole : > Hello! > > I am trying to find the best approach for creating a distributed cluster > based on the following specs: > > -????????? 2 datacenters (close to each other and connected by fiber > channel); > > -????????? 2 storages (one on each datacenter); > > -????????? 1 server on each data center > > > > In an event of a failure of a single datacenter, the remaining datacenter > must be responsible for assuring all services. > > > > My approach would be to (I don?t exactly know if it is possible): > > -????????? Present one LUN from each storage to both servers: > > ?? server A (DC A) - LUN A (STORAGE A) > > ?????????? ???????????????????????????? ????? LUN B (STORAGE B) > > ??? ???????server B (DC B) - LUN B (STORAGE B) > > ??????????????????????????????? ????? LUN A (STORAGE A) > > > > -????????? Create one logical volume on each of the servers using RAID1: > > ??? server A (DC A) - LVM A /mydata [RAID1 de LUN A e LUN B] > > ??? server B (DC B) - LVM B /mydata [RAID1 de LUN B e LUN A] > > > > -????????? Configure a RedHat Cluster; > > -????????? Put /mydata on the cluster; > > > > Is this a feasible approach? Is there a better option? > > ?A. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From gordan at bobich.net Thu Mar 12 00:49:13 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 12 Mar 2009 00:49:13 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 Message-ID: <49B85C09.8040601@bobich.net> I have a two-node cluster and ever since I updated the kernel and cluster components I cannot get more than one node running with GFS. Here are the package versions I have: kernel-2.6.18-92.1.22.el5 cman-2.0.98-1 kmod-gfs-0.1.23-5.el5_2.4 gfs-utils-0.1.17-1.el5 gfs2-utils-0.1.53-1.1 Node 2 starts up OK, but I see this in the syslog: node2 ccsd[5897]: Unable to perform sendto: Cannot assign requested address When I power up node2, it just gets strange and the whole thing locks up: node2 openais[5941]: [CMAN ] cman killed by node 1 because we rejoined the cluster without a full restart node2 groupd[5953]: cman_get_nodes error -1 104 node2 gfs_controld[5995]: groupd_dispatch error -1 errno 11 node2 gfs_controld[5995]: groupd connection died node2 gfs_controld[5995]: cluster is down, exiting So for some reason node 1's joining makes node 2 get kicked out of the cluster - but worse, it doesn't seem to initiate fencing. Instead, the whole cluster just locks up on GFS access. What am I missing? What should I be looking for in the logs? This cluster worked fine before the update. I found this: http://rhn.redhat.com/errata/RHBA-2009-0189.html but updating cman to 2.0.98 as per the RHBA didn't fix the problem. Gordan From fdinitto at redhat.com Thu Mar 12 05:05:55 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 12 Mar 2009 06:05:55 +0100 Subject: [Linux-cluster] Re: [Cluster-devel] some question about gnbd ? In-Reply-To: <001501c9a2bc$ab6a46b0$023ed410$@net> References: <001501c9a2bc$ab6a46b0$023ed410$@net> Message-ID: <1236834355.24337.44.camel@cerberus.int.fabbione.net> Hi, moving the thread to linux-cluster as it doesn't belong to cluster-devel. On Thu, 2009-03-12 at 10:45 +0800, Figaro Yang wrote: > > > Hi all ~ > > > > I have some question about gnbd ,would you please kindly give me some > advice . > > > > Now, the market already has many Global file system which can support > infiniBand RDMA such as Lustre File , NFS over RDMA ?, but does GNBD > has the some function? GNBD has been deprecated upstream exactly because there are many and more standard alternatives. Fabio From ccaulfie at redhat.com Thu Mar 12 07:45:20 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Thu, 12 Mar 2009 07:45:20 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <49B85C09.8040601@bobich.net> References: <49B85C09.8040601@bobich.net> Message-ID: <49B8BD90.6030800@redhat.com> Gordan Bobic wrote: > I have a two-node cluster and ever since I updated the kernel and > cluster components I cannot get more than one node running with GFS. > > Here are the package versions I have: > kernel-2.6.18-92.1.22.el5 > cman-2.0.98-1 > kmod-gfs-0.1.23-5.el5_2.4 > gfs-utils-0.1.17-1.el5 > gfs2-utils-0.1.53-1.1 > > Node 2 starts up OK, but I see this in the syslog: > > node2 ccsd[5897]: Unable to perform sendto: Cannot assign requested address > > When I power up node2, it just gets strange and the whole thing locks up: > node2 openais[5941]: [CMAN ] cman killed by node 1 because we rejoined > the cluster without a full restart > node2 groupd[5953]: cman_get_nodes error -1 104 > node2 gfs_controld[5995]: groupd_dispatch error -1 errno 11 > node2 gfs_controld[5995]: groupd connection died > node2 gfs_controld[5995]: cluster is down, exiting > > So for some reason node 1's joining makes node 2 get kicked out of the > cluster - but worse, it doesn't seem to initiate fencing. Instead, the > whole cluster just locks up on GFS access. > > What am I missing? What should I be looking for in the logs? This > cluster worked fine before the update. > > I found this: > http://rhn.redhat.com/errata/RHBA-2009-0189.html > but updating cman to 2.0.98 as per the RHBA didn't fix the problem. > it sounds like you've hit this bug: https://bugzilla.redhat.com/show_bug.cgi?id=487397 -- Chrissie From ghodgens at inpses.co.uk Thu Mar 12 08:03:23 2009 From: ghodgens at inpses.co.uk (Graeme Hodgens) Date: Thu, 12 Mar 2009 08:03:23 -0000 Subject: [Linux-cluster] Luci initialisation References: <6FA9DE3DC907DA408817927FF524F2C8969296@inpseswin02.inpses.co.uk> <20090311193106.GA26372@redhat.com> Message-ID: <6FA9DE3DC907DA408817927FF524F2C896969A@inpseswin02.inpses.co.uk> RHEL 5 32 bit luci-0.8-30.el5 Thanks -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ryan McCabe Sent: 11 March 2009 19:31 To: linux clustering Subject: Re: [Linux-cluster] Luci initialisation On Tue, Mar 10, 2009 at 11:03:47AM -0000, Graeme Hodgens wrote: > [root at myservername]# luci_admin password > > The Luci site has not been initialized. > > To initialize it, execute > > /usr/sbin/luci_admin init > > > > And hey presto I am back to where I started. > > > > Any ideas if I am missing dependant packages or any other permissions, > etc? What OS version and luci version are you using? Ryan -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Thu Mar 12 08:55:34 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 12 Mar 2009 08:55:34 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <49B8BD90.6030800@redhat.com> References: <49B85C09.8040601@bobich.net> <49B8BD90.6030800@redhat.com> Message-ID: <49B8CE06.8050706@bobich.net> Chrissie Caulfield wrote: > Gordan Bobic wrote: >> I have a two-node cluster and ever since I updated the kernel and >> cluster components I cannot get more than one node running with GFS. >> >> Here are the package versions I have: >> kernel-2.6.18-92.1.22.el5 >> cman-2.0.98-1 >> kmod-gfs-0.1.23-5.el5_2.4 >> gfs-utils-0.1.17-1.el5 >> gfs2-utils-0.1.53-1.1 >> >> Node 2 starts up OK, but I see this in the syslog: >> >> node2 ccsd[5897]: Unable to perform sendto: Cannot assign requested address >> >> When I power up node2, it just gets strange and the whole thing locks up: >> node2 openais[5941]: [CMAN ] cman killed by node 1 because we rejoined >> the cluster without a full restart >> node2 groupd[5953]: cman_get_nodes error -1 104 >> node2 gfs_controld[5995]: groupd_dispatch error -1 errno 11 >> node2 gfs_controld[5995]: groupd connection died >> node2 gfs_controld[5995]: cluster is down, exiting >> >> So for some reason node 1's joining makes node 2 get kicked out of the >> cluster - but worse, it doesn't seem to initiate fencing. Instead, the >> whole cluster just locks up on GFS access. >> >> What am I missing? What should I be looking for in the logs? This >> cluster worked fine before the update. >> >> I found this: >> http://rhn.redhat.com/errata/RHBA-2009-0189.html >> but updating cman to 2.0.98 as per the RHBA didn't fix the problem. >> > > it sounds like you've hit this bug: > > https://bugzilla.redhat.com/show_bug.cgi?id=487397 What was the last known version of cman that works? 2.0.73? Gordan From grimme at atix.de Thu Mar 12 09:35:09 2009 From: grimme at atix.de (Marc Grimme) Date: Thu, 12 Mar 2009 10:35:09 +0100 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <49B8CE06.8050706@bobich.net> References: <49B85C09.8040601@bobich.net> <49B8BD90.6030800@redhat.com> <49B8CE06.8050706@bobich.net> Message-ID: <200903121035.09697.grimme@atix.de> Hi Gordan, more information can be found in this bug (at least you can get the information how far this bug goes back). https://bugzilla.redhat.com/show_bug.cgi?id=485026 -- Gruss / Regards, Marc Grimme http://www.atix.de/ http://www.open-sharedroot.org/ From gordan at bobich.net Thu Mar 12 09:50:37 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 12 Mar 2009 09:50:37 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <200903121035.09697.grimme@atix.de> References: <49B85C09.8040601@bobich.net> <49B8BD90.6030800@redhat.com> <49B8CE06.8050706@bobich.net> <200903121035.09697.grimme@atix.de> Message-ID: <49B8DAED.70002@bobich.net> Marc Grimme wrote: > Hi Gordan, > more information can be found in this bug (at least you can get the > information how far this bug goes back). > https://bugzilla.redhat.com/show_bug.cgi?id=485026 Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but that makes my OSR startup fail while joining the fencing domain. :-( I haven't tried 2.0.60 from 5.0 yet, but I suspect that I might have to roll back some of the other things packages to get <=2.0.73 working (unhelpfully, RPM dependencies don't seem to point at any problems with versions of other packages). :-/ Gordan From hlawatschek at atix.de Thu Mar 12 10:52:32 2009 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Thu, 12 Mar 2009 11:52:32 +0100 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <49B8DAED.70002@bobich.net> References: <49B85C09.8040601@bobich.net> <200903121035.09697.grimme@atix.de> <49B8DAED.70002@bobich.net> Message-ID: <200903121152.32702.hlawatschek@atix.de> Hi Gordon, > Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both > affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but > that makes my OSR startup fail while joining the fencing domain. :-( This is because the supported parameters in fence_tool have been changed. You'll need to modify the way fence_tool is called in /etc/rhel5/gfs-lib.sh in your open-sharedroot setup. -Mark From gordan at bobich.net Thu Mar 12 11:04:16 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 12 Mar 2009 11:04:16 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <200903121152.32702.hlawatschek@atix.de> References: <49B85C09.8040601@bobich.net> <200903121035.09697.grimme@atix.de> <49B8DAED.70002@bobich.net> <200903121152.32702.hlawatschek@atix.de> Message-ID: <8ff13324c85112f7ea1b1e15477b723e@localhost> On Thu, 12 Mar 2009 11:52:32 +0100, Mark Hlawatschek wrote: > Hi Gordon, > >> Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both >> affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but >> that makes my OSR startup fail while joining the fencing domain. :-( > > This is because the supported parameters in fence_tool have been changed. > You'll need to modify the way fence_tool is called in /etc/rhel5/gfs-lib.sh > in your open-sharedroot setup. Oh, I see. Thanks for that. Is it doing based on a release version check or is it hard-coded (i.e. not working on RHEL 5.0/5.1 any more)? Is there a diff for the change-over? Sorry if this is all obvious from looking at the code, I'm not in front of the suffering cluster right now. :-/ Thanks. Gordan From hlawatschek at atix.de Thu Mar 12 12:27:09 2009 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Thu, 12 Mar 2009 13:27:09 +0100 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <8ff13324c85112f7ea1b1e15477b723e@localhost> References: <49B85C09.8040601@bobich.net> <200903121152.32702.hlawatschek@atix.de> <8ff13324c85112f7ea1b1e15477b723e@localhost> Message-ID: <200903121327.09733.hlawatschek@atix.de> On Thursday 12 March 2009 12:04:16 Gordan Bobic wrote: > On Thu, 12 Mar 2009 11:52:32 +0100, Mark Hlawatschek > > wrote: > > Hi Gordon, > > > >> Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both > >> affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but > >> that makes my OSR startup fail while joining the fencing domain. :-( > > > > This is because the supported parameters in fence_tool have been changed. > > > > You'll need to modify the way fence_tool is called in > > /etc/rhel5/gfs-lib.sh > > > in your open-sharedroot setup. > > Oh, I see. Thanks for that. Is it doing based on a release version check or > is it hard-coded (i.e. not working on RHEL 5.0/5.1 any more)? Is there a > diff for the change-over? Sorry if this is all obvious from looking at the > code, I'm not in front of the suffering cluster right now. :-/ A new parameter (-w) has been added to the fence_tool utility in RHEL5.3. In the latest open-sharedroot preview packages the -w option is used. If your fence_tool version does not support -w, the join command will fail. The rpm changelog shows the following: # rpm -q --changelog cman * Wed Dec 03 2008 Chris Feist - 2.0.84-2_el5_2.3 - Added missing patch to allow delaying fence_tool joins. - Resolves rhbz#474467 -Mark From hlawatschek at atix.de Thu Mar 12 12:42:11 2009 From: hlawatschek at atix.de (Mark Hlawatschek) Date: Thu, 12 Mar 2009 13:42:11 +0100 Subject: [Linux-cluster] ccsd problems after update to RHEL 5.2/5.3 In-Reply-To: <200903121327.09733.hlawatschek@atix.de> References: <49B85C09.8040601@bobich.net> <8ff13324c85112f7ea1b1e15477b723e@localhost> <200903121327.09733.hlawatschek@atix.de> Message-ID: <200903121342.11962.hlawatschek@atix.de> The new parameter is -m. Sorry for the confusion. -Mark > A new parameter (-w) has been added to the fence_tool utility in RHEL5.3. > In the latest open-sharedroot preview packages the -w option is used. If > your fence_tool version does not support -w, the join command will fail. > The rpm changelog shows the following: > # rpm -q --changelog cman > * Wed Dec 03 2008 Chris Feist - 2.0.84-2_el5_2.3 > - Added missing patch to allow delaying fence_tool joins. > - Resolves rhbz#474467 From gianluca.cecchi at gmail.com Thu Mar 12 16:11:32 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Thu, 12 Mar 2009 17:11:32 +0100 Subject: [Linux-cluster] ccsd problems after update to RHEL Message-ID: <561c252c0903120911t38986fdfmbc4801a1471cfe7e@mail.gmail.com> On Thu, 12 Mar 2009 11:04:16 +0000 Gordan Bobic wrote: >>Marc Grimme wrote: >> Hi Gordan, >> more information can be found in this bug (at least you can get the >> information how far this bug goes back). >> https://bugzilla.redhat.com/show_bug.cgi?id=485026 > Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both > affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but you can update to 5U3 and then in comment #86 of the bug referred by Marc you have links to download working cman binaries for 5U3. HIH Gianluca From gordan at bobich.net Thu Mar 12 16:20:34 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 12 Mar 2009 16:20:34 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL In-Reply-To: <561c252c0903120911t38986fdfmbc4801a1471cfe7e@mail.gmail.com> References: <561c252c0903120911t38986fdfmbc4801a1471cfe7e@mail.gmail.com> Message-ID: I saw that, but they still seem to be for cman-2.0.98, which I've tried from the latest 5.3 package, and it's still broken. The bug page Chrissie pointed at still doesn't list any resolution. On Thu, 12 Mar 2009 17:11:32 +0100, Gianluca Cecchi wrote: > On Thu, 12 Mar 2009 11:04:16 +0000 Gordan Bobic wrote: > >>>Marc Grimme wrote: >>> Hi Gordan, >>> more information can be found in this bug (at least you can get the >>> information how far this bug goes back). >>> https://bugzilla.redhat.com/show_bug.cgi?id=485026 > >> Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both >> affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but > you can update to 5U3 and then in comment #86 of the bug referred by > Marc you have links to download working cman binaries for 5U3. > > HIH > Gianluca > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Thu Mar 12 16:30:32 2009 From: gordan at bobich.net (Gordan Bobic) Date: Thu, 12 Mar 2009 16:30:32 +0000 Subject: [Linux-cluster] ccsd problems after update to RHEL In-Reply-To: References: <561c252c0903120911t38986fdfmbc4801a1471cfe7e@mail.gmail.com> Message-ID: <56d3c47f3a69c3a71585144bd5b7c951@localhost> Sorry, re-reading that I can see what you mean now. I'll try the patched packages listed in a bit. Gordan On Thu, 12 Mar 2009 16:20:34 +0000, Gordan Bobic wrote: > I saw that, but they still seem to be for cman-2.0.98, which I've tried > from the latest 5.3 package, and it's still broken. The bug page Chrissie > pointed at still doesn't list any resolution. > > On Thu, 12 Mar 2009 17:11:32 +0100, Gianluca Cecchi > wrote: >> On Thu, 12 Mar 2009 11:04:16 +0000 Gordan Bobic wrote: >> >>>>Marc Grimme wrote: >>>> Hi Gordan, >>>> more information can be found in this bug (at least you can get the >>>> information how far this bug goes back). >>>> https://bugzilla.redhat.com/show_bug.cgi?id=485026 >> >>> Yeah, I saw that bug entry. The listing says 5.2 and 5.3 are both >>> affected (cman 2.0.84 and 2.0.98). I tried 2.0.73 from RHEL 5.1, but >> you can update to 5U3 and then in comment #86 of the bug referred by >> Marc you have links to download working cman binaries for 5U3. >> >> HIH >> Gianluca >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From Alain.Moulle at bull.net Fri Mar 13 10:20:36 2009 From: Alain.Moulle at bull.net (Alain.Moulle) Date: Fri, 13 Mar 2009 11:20:36 +0100 Subject: [Linux-cluster] CS5 : limit for number of nodes in cluster Message-ID: <49BA3374.5050601@bull.net> Hi , it seems that the CS5 supports up to 128 nodes ... (whereas it was 8 with CS3 and CS4 ? ) did some of you have tested at least the CS5 with more than 10 nodes ? does it reveal any big problem or restriction to have big clusters with CS5 ? Thanks Regards Alain -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Fri Mar 13 10:29:55 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Fri, 13 Mar 2009 10:29:55 +0000 Subject: [Linux-cluster] CS5 : limit for number of nodes in cluster In-Reply-To: <49BA3374.5050601@bull.net> References: <49BA3374.5050601@bull.net> Message-ID: <49BA35A3.7060201@redhat.com> Alain.Moulle wrote: > Hi , > > it seems that the CS5 supports up to 128 nodes ... > (whereas it was 8 with CS3 and CS4 ? ) > > did some of you have tested at least the CS5 with more than 10 nodes ? > > does it reveal any big problem or restriction to have big clusters with > CS5 ? I have tested Red Hat cluster with up to 60 nodes. Some tuning is required to get it that far see http://sources.redhat.com/cluster/wiki/FAQ/CMAN#large_clusters -- Chrissie From jeff.sturm at eprize.com Fri Mar 13 13:08:26 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 13 Mar 2009 09:08:26 -0400 Subject: [Linux-cluster] CS5 : limit for number of nodes in cluster In-Reply-To: <49BA3374.5050601@bull.net> References: <49BA3374.5050601@bull.net> Message-ID: <64D0546C5EBBD147B75DE133D798665F021BA4C9@hugo.eprize.local> 16 nodes here (CentOS 5.2). No problems relating to cluster size. Invest in your network--you won't regret it. We've repeatedly found RHCS is only as good as the network connecting it. We currently have Juniper EX-series switches connected to multiple interfaces on each node. The physical interfaces are bonded in active-passive mode. We're considering active-active (LACP). What have others done? -Jeff ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alain.Moulle Sent: Friday, March 13, 2009 6:21 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] CS5 : limit for number of nodes in cluster Hi , it seems that the CS5 supports up to 128 nodes ... (whereas it was 8 with CS3 and CS4 ? ) did some of you have tested at least the CS5 with more than 10 nodes ? does it reveal any big problem or restriction to have big clusters with CS5 ? Thanks Regards Alain From Gary_Hunt at gallup.com Fri Mar 13 15:37:14 2009 From: Gary_Hunt at gallup.com (Hunt, Gary) Date: Fri, 13 Mar 2009 10:37:14 -0500 Subject: [Linux-cluster] Adding node using luci admin interface Message-ID: Having issues with adding a node to an existing cluster. Hope I am just missing something simple. RHEL 5.3 2 node cluster with quorum disk. Whenever I try to add a third node with the luci admin interface the new node comes up with a cluster.conf that looks like this The 2 existing nodes get an updated cluster.conf, but the new node doesn't. Am I missing something? All I am doing is giving luci the hostname and root password of the new node in the Add a node section. Thanks Gary ________________________________ IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kadlec at mail.kfki.hu Fri Mar 13 16:54:11 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 13 Mar 2009 17:54:11 +0100 (CET) Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? Message-ID: Hello, www.openais.org cannot be accessed as the (hosting) account is not active at siteground.com. The last whitetank branch which can be downloaded from freshmeat/osdl.org is openais-0.80.4.tar.gz. However that release has got a backward compatibility bug, see https://bugzilla.redhat.com/show_bug.cgi?id=487214. So, is there a way to donwload openais-0.80.5.tar.gz (or openais-0.80.5-4.el5_4) from somewhere? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From sdake at redhat.com Fri Mar 13 18:01:42 2009 From: sdake at redhat.com (Steven Dake) Date: Fri, 13 Mar 2009 11:01:42 -0700 Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: References: Message-ID: <1236967303.28410.219.camel@sdake-laptop> On Fri, 2009-03-13 at 17:54 +0100, Kadlecsik Jozsef wrote: > Hello, > > www.openais.org cannot be accessed as the (hosting) account is not active > at siteground.com. > working on this - problems with hosting site. > The last whitetank branch which can be downloaded from freshmeat/osdl.org > is openais-0.80.4.tar.gz. However that release has got a backward > compatibility bug, see https://bugzilla.redhat.com/show_bug.cgi?id=487214. > > So, is there a way to donwload openais-0.80.5.tar.gz (or > openais-0.80.5-4.el5_4) from somewhere? > The rpm openais-0.80.5-4 is not available. openais-0.80.5 has a backward compatibility issue as well because of the backport of the confdb system. You can always obtain the latest sources from svn via: svn checkout http://svn.fedorahosted.org/svn/openais/branches/whitetank commits on whitetank are very locked down so the source should be fairly safe for your use. You might be able to obtain a backwards compat version by reverting confdb: home/whitetank: svn diff -r 1709:1710 | patch -p0 --reverse home/whitetank: svn diff -r 1713:1714 | patch -p0 --reverse Regards -steve > Best regards, > Jozsef > -- > E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: KFKI Research Institute for Particle and Nuclear Physics > H-1525 Budapest 114, POB. 49, Hungary > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From kadlec at mail.kfki.hu Fri Mar 13 18:59:35 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 13 Mar 2009 19:59:35 +0100 (CET) Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: <1236967303.28410.219.camel@sdake-laptop> References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: Hi, On Fri, 13 Mar 2009, Steven Dake wrote: > > The last whitetank branch which can be downloaded from freshmeat/osdl.org > > is openais-0.80.4.tar.gz. However that release has got a backward > > compatibility bug, see https://bugzilla.redhat.com/show_bug.cgi?id=487214. > > > > So, is there a way to donwload openais-0.80.5.tar.gz (or > > openais-0.80.5-4.el5_4) from somewhere? > > > The rpm openais-0.80.5-4 is not available. > > openais-0.80.5 has a backward compatibility issue as well because of the > backport of the confdb system. You can always obtain the latest sources > from svn via: > > svn checkout http://svn.fedorahosted.org/svn/openais/branches/whitetank > > commits on whitetank are very locked down so the source should be fairly > safe for your use. You might be able to obtain a backwards compat > version by reverting confdb: > > home/whitetank: svn diff -r 1709:1710 | patch -p0 --reverse > home/whitetank: svn diff -r 1713:1714 | patch -p0 --reverse Thank you the info! We are in the mid of an upgrade from openais-0.80.3 and cluster-2.01.00 to openais-0.80.4 and cluster-2.03.11 in a five node cluster. A single test node is upgraded to openais-0.80.4/cluster-2.03.11 and it almost works: sometimes the upgraded node just starts up fine, but mostly cman fails due to aisexec. However if cman is restarted, then it works OK. If it's just a backward compatibility issue and we'd know in advance that all the other nodes can safely be upgraded (with starting cman two times), then that's OK. Or do you suggest to go with whitetank from svn, the two patches above reverted? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From sdake at redhat.com Fri Mar 13 19:48:22 2009 From: sdake at redhat.com (Steven Dake) Date: Fri, 13 Mar 2009 12:48:22 -0700 Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: <1236973702.28410.232.camel@sdake-laptop> On Fri, 2009-03-13 at 19:59 +0100, Kadlecsik Jozsef wrote: > Hi, > > On Fri, 13 Mar 2009, Steven Dake wrote: > > > > The last whitetank branch which can be downloaded from freshmeat/osdl.org > > > is openais-0.80.4.tar.gz. However that release has got a backward > > > compatibility bug, see https://bugzilla.redhat.com/show_bug.cgi?id=487214. > > > > > > So, is there a way to donwload openais-0.80.5.tar.gz (or > > > openais-0.80.5-4.el5_4) from somewhere? > > > > > The rpm openais-0.80.5-4 is not available. > > > > openais-0.80.5 has a backward compatibility issue as well because of the > > backport of the confdb system. You can always obtain the latest sources > > from svn via: > > > > svn checkout http://svn.fedorahosted.org/svn/openais/branches/whitetank > > > > commits on whitetank are very locked down so the source should be fairly > > safe for your use. You might be able to obtain a backwards compat > > version by reverting confdb: > > > > home/whitetank: svn diff -r 1709:1710 | patch -p0 --reverse > > home/whitetank: svn diff -r 1713:1714 | patch -p0 --reverse > > Thank you the info! > > We are in the mid of an upgrade from openais-0.80.3 and cluster-2.01.00 to > openais-0.80.4 and cluster-2.03.11 in a five node cluster. A single test > node is upgraded to openais-0.80.4/cluster-2.03.11 and it almost works: > sometimes the upgraded node just starts up fine, but mostly cman fails > due to aisexec. However if cman is restarted, then it works OK. > > If it's just a backward compatibility issue and we'd know in advance that > all the other nodes can safely be upgraded (with starting cman two times), > then that's OK. Or do you suggest to go with whitetank from svn, the two > patches above reverted? > openais-0.80.3 from tarball has around 80 defects. I'd recommend using the latest svn with the confdb patches reverted. openais-0.80.6 should be released within the next 30 days with confdb backward compat and a possible defect in the cpg system under certain testing done in the community. I'd like to stress we really lock down source commits in whitetank (0.80.x) to ensure high code quality since it has thousands of field deployments and even little defects can have dreadful effects. Regards -steve > Best regards, > Jozsef > -- > E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: KFKI Research Institute for Particle and Nuclear Physics > H-1525 Budapest 114, POB. 49, Hungary From vu at sivell.com Fri Mar 13 20:57:23 2009 From: vu at sivell.com (vu pham) Date: Fri, 13 Mar 2009 14:57:23 -0600 Subject: [Linux-cluster] Adding node using luci admin interface In-Reply-To: References: Message-ID: <49BAC8B3.1070302@sivell.com> Hunt, Gary wrote: > Having issues with adding a node to an existing cluster. Hope I am just > missing something simple. > > > > RHEL 5.3 2 node cluster with quorum disk. Whenever I try to add a third > node with the luci admin interface the new node comes up with a > cluster.conf that looks like this > > > > > > > > > > > > > > > > > > > > > > The 2 existing nodes get an updated cluster.conf, but the new node > doesn?t. Am I missing something? All I am doing is giving luci the > hostname and root password of the new node in the Add a node section. > > > luci that comes with 5.3/RHN has problem. Try luci with 5.2 Vu From vu at sivell.com Fri Mar 13 21:54:28 2009 From: vu at sivell.com (vu pham) Date: Fri, 13 Mar 2009 15:54:28 -0600 Subject: [Linux-cluster] Adding node using luci admin interface In-Reply-To: <49BAC8B3.1070302@sivell.com> References: <49BAC8B3.1070302@sivell.com> Message-ID: <49BAD614.8040309@sivell.com> vu pham wrote: > Hunt, Gary wrote: >> Having issues with adding a node to an existing cluster. Hope I am >> just missing something simple. >> >> >> >> RHEL 5.3 2 node cluster with quorum disk. Whenever I try to add a >> third node with the luci admin interface the new node comes up with a >> cluster.conf that looks like this >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> The 2 existing nodes get an updated cluster.conf, but the new node >> doesn?t. Am I missing something? All I am doing is giving luci the >> hostname and root password of the new node in the Add a node section. >> >> >> > > luci that comes with 5.3/RHN has problem. Try luci with 5.2 > I may be wrong. I just tried luci on my 5.2 system to add a node and it failed. Anyway, it successful updates cluster.conf of existing nodes so all I did after to add new node to the cluster is : - scp /etc/cluster/* from existing node to new node - install the following packages : gfs2-utils gfs-utils rgmanager kmod-gfs2-xen kmod-gfs-xen lvm2-cluster. You may not need the *-xen-* packages if you are not on a domU. - start services cman, clvmd, gfs2 and rgmanager. After that my new node joins the cluster just fine . Then luci sees my new node, too. I love luci, I hate lucy. I still not remember all the options of cluster services so I still have to use it. For options I remember, I just edit cluster.conf with vi and ccs_tool update. Vu From vu at sivell.com Sat Mar 14 02:07:26 2009 From: vu at sivell.com (Vu Pham) Date: Fri, 13 Mar 2009 21:07:26 -0500 Subject: [Linux-cluster] man pages for Message-ID: <49BB115E.1020809@sivell.com> Which man pages describe tags under in cluster.conf ? For example, I am looking for man pages for tags such as , and all the relating ones below those tags such as , Thanks, Vu From henry.robertson at hjrconsulting.com Sun Mar 15 03:52:16 2009 From: henry.robertson at hjrconsulting.com (Henry Robertson) Date: Sat, 14 Mar 2009 23:52:16 -0400 Subject: [Linux-cluster] 5.3 -- Multiple Apache + clurgmgrd Message-ID: I'm having an issue starting a clustered Apache service in 5.3. If I do a basic ip->httpd service, I get an error about a missing PID. snippet from cluster.conf Error: Mar 15 07:43:25 ag01 clurgmgrd: [18329]: Checking Existence Of File /var/run/cluster/apache/apache:httpd1.pid [apache:httpd1] > Failed - File Doesn't Exist Mar 15 07:43:25 ag01 clurgmgrd: [18329]: Stopping Service apache:httpd1 > Failed If I go into /etc/cluster/apache/apache:httpd/httpd.conf and uncomment the ###Listen 192.168.51.88:80 it works fine. It also genereates the proper PID. Ideas on what's going on here? I'm pretty sure it's supposed to inherit the Listen address from the cluster.conf file, not have it commented out. Regards, Henry Robertson HJR Consulting LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From chattygk at gmail.com Mon Mar 16 05:42:08 2009 From: chattygk at gmail.com (Chaitanya Kulkarni) Date: Mon, 16 Mar 2009 11:12:08 +0530 Subject: [Linux-cluster] Can two clusters have same name? Message-ID: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> Hi All, What happens if in the same network, we try to create two clusters with the same name? Does it cause any problem? Thanks, Chaitanya -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Mon Mar 16 07:58:49 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Mon, 16 Mar 2009 07:58:49 +0000 Subject: [Linux-cluster] Can two clusters have same name? In-Reply-To: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> References: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> Message-ID: <49BE06B9.6070201@redhat.com> Chaitanya Kulkarni wrote: > Hi All, > > What happens if in the same network, we try to create two clusters with > the same name? > > Does it cause any problem? YES LOTS! At best the two clusters will merge into one, at worst you will get node evictions because of clashes between node IDs Actually you *can* do this if you change the cluster_id/multicast address or port number in cluster.conf. But need to be careful and it is not recommended. The main reason I say not to do this is that GFS volumes have the cluster name embedded in the super block. If you have two clusters with the same cluster name on the same SAN then it's going to be very easy to totally corrupt the GFS filesystem by mounting it on two different clusters. Chrissie From chattygk at gmail.com Mon Mar 16 08:16:06 2009 From: chattygk at gmail.com (Chaitanya Kulkarni) Date: Mon, 16 Mar 2009 13:46:06 +0530 Subject: [Linux-cluster] Can two clusters have same name? In-Reply-To: <49BE06B9.6070201@redhat.com> References: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> <49BE06B9.6070201@redhat.com> Message-ID: <1ad236320903160116m62deee7eqdae7b135db48ccf6@mail.gmail.com> Thanks for your reply Chrissie. But is this, i.e. deployment of clusters with same name, a valid scenario? How often (as in say 1 in a 100) may I see such deployments, if at all? Thanks, Chaitanya On Mon, Mar 16, 2009 at 1:28 PM, Chrissie Caulfield wrote: > Chaitanya Kulkarni wrote: > > Hi All, > > > > What happens if in the same network, we try to create two clusters with > > the same name? > > > > Does it cause any problem? > > YES LOTS! > > At best the two clusters will merge into one, at worst you will get node > evictions because of clashes between node IDs > > Actually you *can* do this if you change the cluster_id/multicast > address or port number in cluster.conf. But need to be careful and it is > not recommended. > > The main reason I say not to do this is that GFS volumes have the > cluster name embedded in the super block. If you have two clusters with > the same cluster name on the same SAN then it's going to be very easy to > totally corrupt the GFS filesystem by mounting it on two different > clusters. > > > Chrissie > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Mon Mar 16 08:37:05 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Mon, 16 Mar 2009 08:37:05 +0000 Subject: [Linux-cluster] Can two clusters have same name? In-Reply-To: <1ad236320903160116m62deee7eqdae7b135db48ccf6@mail.gmail.com> References: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> <49BE06B9.6070201@redhat.com> <1ad236320903160116m62deee7eqdae7b135db48ccf6@mail.gmail.com> Message-ID: <49BE0FB1.5070905@redhat.com> Chaitanya Kulkarni wrote: > Thanks for your reply Chrissie. > > But is this, i.e. deployment of clusters with same name, a valid > scenario? How often (as in say 1 in a 100) may I see such deployments, > if at all? > Sorry, I really have no idea how many times you might see such deployments. How would I work it out?! Cluster names are chosen by the administrators ... those people are not easily predictable ;-) Chrissie > > On Mon, Mar 16, 2009 at 1:28 PM, Chrissie Caulfield > wrote: > > Chaitanya Kulkarni wrote: > > Hi All, > > > > What happens if in the same network, we try to create two clusters > with > > the same name? > > > > Does it cause any problem? > > YES LOTS! > > At best the two clusters will merge into one, at worst you will get node > evictions because of clashes between node IDs > > Actually you *can* do this if you change the cluster_id/multicast > address or port number in cluster.conf. But need to be careful and it is > not recommended. > > The main reason I say not to do this is that GFS volumes have the > cluster name embedded in the super block. If you have two clusters with > the same cluster name on the same SAN then it's going to be very easy to > totally corrupt the GFS filesystem by mounting it on two different > clusters. > > > Chrissie > From chattygk at gmail.com Mon Mar 16 08:47:47 2009 From: chattygk at gmail.com (Chaitanya Kulkarni) Date: Mon, 16 Mar 2009 14:17:47 +0530 Subject: [Linux-cluster] Can two clusters have same name? In-Reply-To: <49BE0FB1.5070905@redhat.com> References: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> <49BE06B9.6070201@redhat.com> <1ad236320903160116m62deee7eqdae7b135db48ccf6@mail.gmail.com> <49BE0FB1.5070905@redhat.com> Message-ID: <1ad236320903160147k62e07a8av62cad2cce00aacf7@mail.gmail.com> True. Anyways. Thanks for your help. Regards, Chaitanya On Mon, Mar 16, 2009 at 2:07 PM, Chrissie Caulfield wrote: > Chaitanya Kulkarni wrote: > > Thanks for your reply Chrissie. > > > > But is this, i.e. deployment of clusters with same name, a valid > > scenario? How often (as in say 1 in a 100) may I see such deployments, > > if at all? > > > > Sorry, I really have no idea how many times you might see such > deployments. How would I work it out?! Cluster names are chosen by the > administrators ... those people are not easily predictable ;-) > > Chrissie > > > > > > On Mon, Mar 16, 2009 at 1:28 PM, Chrissie Caulfield > > wrote: > > > > Chaitanya Kulkarni wrote: > > > Hi All, > > > > > > What happens if in the same network, we try to create two clusters > > with > > > the same name? > > > > > > Does it cause any problem? > > > > YES LOTS! > > > > At best the two clusters will merge into one, at worst you will get > node > > evictions because of clashes between node IDs > > > > Actually you *can* do this if you change the cluster_id/multicast > > address or port number in cluster.conf. But need to be careful and it > is > > not recommended. > > > > The main reason I say not to do this is that GFS volumes have the > > cluster name embedded in the super block. If you have two clusters > with > > the same cluster name on the same SAN then it's going to be very easy > to > > totally corrupt the GFS filesystem by mounting it on two different > > clusters. > > > > > > Chrissie > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rkprotocol at gmail.com Mon Mar 16 08:50:09 2009 From: rkprotocol at gmail.com (ramesh kasula) Date: Mon, 16 Mar 2009 14:20:09 +0530 Subject: [Linux-cluster] Can two clusters have same name? In-Reply-To: <1ad236320903160116m62deee7eqdae7b135db48ccf6@mail.gmail.com> References: <1ad236320903152242v5230fc75h88b615e371cd814e@mail.gmail.com> <49BE06B9.6070201@redhat.com> <1ad236320903160116m62deee7eqdae7b135db48ccf6@mail.gmail.com> Message-ID: <98fa30820903160150x3a601806hdd77589a3db53eb4@mail.gmail.com> u stupids , dont send any mails to me, remove my mail from ur group 2009/3/16 Chaitanya Kulkarni > Thanks for your reply Chrissie. > > But is this, i.e. deployment of clusters with same name, a valid scenario? > How often (as in say 1 in a 100) may I see such deployments, if at all? > > Thanks, > Chaitanya > > On Mon, Mar 16, 2009 at 1:28 PM, Chrissie Caulfield wrote: > >> Chaitanya Kulkarni wrote: >> > Hi All, >> > >> > What happens if in the same network, we try to create two clusters with >> > the same name? >> > >> > Does it cause any problem? >> >> YES LOTS! >> >> At best the two clusters will merge into one, at worst you will get node >> evictions because of clashes between node IDs >> >> Actually you *can* do this if you change the cluster_id/multicast >> address or port number in cluster.conf. But need to be careful and it is >> not recommended. >> >> The main reason I say not to do this is that GFS volumes have the >> cluster name embedded in the super block. If you have two clusters with >> the same cluster name on the same SAN then it's going to be very easy to >> totally corrupt the GFS filesystem by mounting it on two different >> clusters. >> >> >> Chrissie >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kadlec at mail.kfki.hu Mon Mar 16 09:33:24 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Mon, 16 Mar 2009 10:33:24 +0100 (CET) Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: <1236967303.28410.219.camel@sdake-laptop> References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: Hi, On Fri, 13 Mar 2009, Steven Dake wrote: > openais-0.80.5 has a backward compatibility issue as well because of the > backport of the confdb system. You can always obtain the latest sources > from svn via: > > svn checkout http://svn.fedorahosted.org/svn/openais/branches/whitetank > > commits on whitetank are very locked down so the source should be fairly > safe for your use. You might be able to obtain a backwards compat > version by reverting confdb: > > home/whitetank: svn diff -r 1709:1710 | patch -p0 --reverse > home/whitetank: svn diff -r 1713:1714 | patch -p0 --reverse openais from svn and the two commits above reverted does not start up: Starting /usr/sbin/aisexec aisexec -f CMAN_NODENAME=saturn-gfs CMAN_DEBUGLOG=255 OPENAIS_DEFAULT_CONFIG_IFACE=cmanconfig CMAN_PIPE=4 CC: setting logger levels forked process ID is 8264 [MAIN ] AIS Executive Service RELEASE 'subrev 1152 version 0.80' [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. [MAIN ] Copyright (C) 2006 Red Hat, Inc. [MAIN ] AIS Executive Service: started and ready to provide service. [MAIN ] Using override node name saturn-gfs [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) [TOTEM] send threads (0 threads) [TOTEM] RRP token expired timeout (495 ms) [TOTEM] RRP token problem counter (2000 ms) [TOTEM] RRP threshold (10 problem count) [TOTEM] RRP mode set to none. [TOTEM] heartbeat_failures_allowed (0) [TOTEM] max_network_delay (50 ms) [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 [TOTEM] Receive multicast socket recv buffer size (288000 bytes). [TOTEM] Transmit multicast socket send buffer size (262142 bytes). [TOTEM] The network interface [192.168.192.18] is now up. [TOTEM] Created or loaded sequence id 3452.192.168.192.18 for this ring. [TOTEM] entering GATHER state from 15. aisexec died: Error, reason code is 4 According to strace, it's a segmentation fault. Could I do anything to make it work? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From wferi at niif.hu Mon Mar 16 10:06:30 2009 From: wferi at niif.hu (Ferenc Wagner) Date: Mon, 16 Mar 2009 11:06:30 +0100 Subject: [Linux-cluster] distributed cluster In-Reply-To: <640358900903110510y58f7147ag468182d96dc11c95@mail.gmail.com> (Andrew Hole's message of "Wed, 11 Mar 2009 12:10:29 +0000") References: <640358900903110510y58f7147ag468182d96dc11c95@mail.gmail.com> Message-ID: <87r60xg4t5.fsf@tac.ki.iif.hu> Andrew Hole writes: > server A (DC A) - LVM A /mydata [RAID1 de LUN A e LUN B] > server B (DC B) - LVM B /mydata [RAID1 de LUN B e LUN A] Observe that you mustn't assemble the same MD RAID1 on the two servers at the same time. MD RAID isn't cluster aware. -- Feri. From mgrac at redhat.com Mon Mar 16 12:03:24 2009 From: mgrac at redhat.com (Marek Grac) Date: Mon, 16 Mar 2009 13:03:24 +0100 Subject: [Linux-cluster] 5.3 -- Multiple Apache + clurgmgrd In-Reply-To: References: Message-ID: <49BE400C.7030703@redhat.com> Hi, Henry Robertson wrote: > I'm having an issue starting a clustered Apache service in 5.3. > > If I do a basic ip->httpd service, I get an error about a missing PID. > > snippet from cluster.conf > > recovery="relocate"> > > name="httpd1" server_root="/etc/httpd/httpd1" shutdown_wait="0"/> > > > > > Error: > Mar 15 07:43:25 ag01 clurgmgrd: [18329]: Checking Existence Of > File /var/run/cluster/apache/apache:httpd1.pid [apache:httpd1] > > Failed - File Doesn't Exist > Mar 15 07:43:25 ag01 clurgmgrd: [18329]: Stopping Service > apache:httpd1 > Failed > > If I go into /etc/cluster/apache/apache:httpd/httpd.conf and uncomment > the ###Listen 192.168.51.88:80 it works > fine. It also genereates the proper PID. > > Ideas on what's going on here? I'm pretty sure it's supposed to > inherit the Listen address from the cluster.conf file, not have it > commented out. Apache is looking for IP in cluster.conf but as a sibling, not a parent tag. marx, From theophanis_kontogiannis at yahoo.gr Mon Mar 16 15:23:35 2009 From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis) Date: Mon, 16 Mar 2009 17:23:35 +0200 Subject: [Linux-cluster] 'ls' makes GFS2 to withdraw Message-ID: <008d01c9a64b$2ddbb020$89931060$@gr> Hello all, I have Centos 5.2, kernel 2.6.18-92.1.22.el5.centos.plus, gfs2-utils-0.1.44-1.el5_2.1 The cluster is two nodes, using DRBD 8.3.2 as the shared block device, and CLVM over it, and GFS2 over it. After an ls in a directory within the GFS2 file system I got the following errors. ........ GFS2: fsid=tweety:gfs2-00.0: fatal: invalid metadata block GFS2: fsid=tweety:gfs2-00.0: bh = 522538 (magic number) GFS2: fsid=tweety:gfs2-00.0: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 332 GFS2: fsid=tweety:gfs2-00.0: about to withdraw this file system GFS2: fsid=tweety:gfs2-00.0: telling LM to withdraw GFS2: fsid=tweety:gfs2-00.0: withdrawn Call Trace: [] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 [] __wait_on_bit+0x60/0x6e [] sync_buffer+0x0/0x3f [] out_of_line_wait_on_bit+0x6c/0x78 [] wake_bit_function+0x0/0x23 [] :gfs2:gfs2_meta_check_ii+0x2c/0x38 [] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e [] :gfs2:gfs2_inode_refresh+0x22/0x2ca [] wake_bit_function+0x0/0x23 [] :gfs2:inode_go_lock+0x29/0x57 [] :gfs2:glock_wait_internal+0x1d4/0x23f [] :gfs2:gfs2_glock_nq+0x1ae/0x1d4 [] :gfs2:gfs2_lookup+0x58/0xa7 [] :gfs2:gfs2_lookup+0x50/0xa7 [] d_alloc+0x174/0x1a9 [] do_lookup+0xe5/0x1e6 [] __link_path_walk+0xa01/0xf42 [] zone_statistics+0x3e/0x6d [] link_path_walk+0x5c/0xe5 [] :gfs2:gfs2_glock_put+0x26/0x133 [] do_path_lookup+0x270/0x2e8 [] getname+0x15b/0x1c1 [] __user_walk_fd+0x37/0x4c [] vfs_lstat_fd+0x18/0x47 [] sys_newlstat+0x19/0x31 [] tracesys+0x71/0xe0 [] tracesys+0xd5/0xe0 .......... Obviously ls was not the cause of the problem but it triggered the events. >From the other node I can have access on the directory that on which the 'ls' triggered the above. The directory is full of files like that: ?--------- ? ? ? ? ? sched_reply Almost 50% of the files are in shown like that with ls. The questions are: 1. Is this a (new) GFS2 bug? 2. Is this a recoverable problem (and how)? 3. After a GFS2 file system gets withdrawn, how do we make the node to use it again, without rebooting? Thank you all for your time. Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: From theophanis_kontogiannis at yahoo.gr Mon Mar 16 15:28:17 2009 From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis) Date: Mon, 16 Mar 2009 17:28:17 +0200 Subject: [Linux-cluster] 'ls' makes GFS2 to withdraw In-Reply-To: <008d01c9a64b$2ddbb020$89931060$@gr> References: <008d01c9a64b$2ddbb020$89931060$@gr> Message-ID: <009801c9a64b$d551b6b0$7ff52410$@gr> Hello again, A few minutes after my initial post the gfs2 file system failed also on the first node. .......... GFS2: fsid=tweety:gfs2-00.1: fatal: invalid metadata block GFS2: fsid=tweety:gfs2-00.1: bh = 522538 (magic number) GFS2: fsid=tweety:gfs2-00.1: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 332 GFS2: fsid=tweety:gfs2-00.1: about to withdraw this file system GFS2: fsid=tweety:gfs2-00.1: telling LM to withdraw GFS2: fsid=tweety:gfs2-00.1: withdrawn Call Trace: [] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 [] __wait_on_bit+0x60/0x6e [] sync_buffer+0x0/0x3f [] out_of_line_wait_on_bit+0x6c/0x78 [] wake_bit_function+0x0/0x23 [] :gfs2:gfs2_meta_check_ii+0x2c/0x38 [] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e [] :gfs2:gfs2_inode_refresh+0x22/0x2ca [] :gfs2:inode_go_lock+0x29/0x57 [] :gfs2:glock_wait_internal+0x1d4/0x23f [] :gfs2:gfs2_glock_nq+0x1ae/0x1d4 [] :gfs2:gfs2_lookup+0x58/0xa7 [] :gfs2:gfs2_lookup+0x50/0xa7 [] d_alloc+0x174/0x1a9 [] do_lookup+0xe5/0x1e6 [] __link_path_walk+0xa01/0xf42 [] link_path_walk+0x5c/0xe5 [] mntput_no_expire+0x19/0x89 [] sys_getxattr+0x51/0x62 [] do_path_lookup+0x270/0x2e8 [] getname+0x15b/0x1c1 [] __user_walk_fd+0x37/0x4c [] vfs_lstat_fd+0x18/0x47 [] mntput_no_expire+0x19/0x89 [] sys_getxattr+0x51/0x62 [] sys_newlstat+0x19/0x31 [] tracesys+0x71/0xe0 [] tracesys+0xd5/0xe0 ............ Thank you all for your time Theophanis Kontogiannis From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Theophanis Kontogiannis Sent: Monday, March 16, 2009 5:24 PM To: 'linux clustering' Subject: [Linux-cluster] 'ls' makes GFS2 to withdraw Hello all, I have Centos 5.2, kernel 2.6.18-92.1.22.el5.centos.plus, gfs2-utils-0.1.44-1.el5_2.1 The cluster is two nodes, using DRBD 8.3.2 as the shared block device, and CLVM over it, and GFS2 over it. After an ls in a directory within the GFS2 file system I got the following errors. ........ GFS2: fsid=tweety:gfs2-00.0: fatal: invalid metadata block GFS2: fsid=tweety:gfs2-00.0: bh = 522538 (magic number) GFS2: fsid=tweety:gfs2-00.0: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 332 GFS2: fsid=tweety:gfs2-00.0: about to withdraw this file system GFS2: fsid=tweety:gfs2-00.0: telling LM to withdraw GFS2: fsid=tweety:gfs2-00.0: withdrawn Call Trace: [] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 [] __wait_on_bit+0x60/0x6e [] sync_buffer+0x0/0x3f [] out_of_line_wait_on_bit+0x6c/0x78 [] wake_bit_function+0x0/0x23 [] :gfs2:gfs2_meta_check_ii+0x2c/0x38 [] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e [] :gfs2:gfs2_inode_refresh+0x22/0x2ca [] wake_bit_function+0x0/0x23 [] :gfs2:inode_go_lock+0x29/0x57 [] :gfs2:glock_wait_internal+0x1d4/0x23f [] :gfs2:gfs2_glock_nq+0x1ae/0x1d4 [] :gfs2:gfs2_lookup+0x58/0xa7 [] :gfs2:gfs2_lookup+0x50/0xa7 [] d_alloc+0x174/0x1a9 [] do_lookup+0xe5/0x1e6 [] __link_path_walk+0xa01/0xf42 [] zone_statistics+0x3e/0x6d [] link_path_walk+0x5c/0xe5 [] :gfs2:gfs2_glock_put+0x26/0x133 [] do_path_lookup+0x270/0x2e8 [] getname+0x15b/0x1c1 [] __user_walk_fd+0x37/0x4c [] vfs_lstat_fd+0x18/0x47 [] sys_newlstat+0x19/0x31 [] tracesys+0x71/0xe0 [] tracesys+0xd5/0xe0 .......... Obviously ls was not the cause of the problem but it triggered the events. >From the other node I can have access on the directory that on which the 'ls' triggered the above. The directory is full of files like that: ?--------- ? ? ? ? ? sched_reply Almost 50% of the files are in shown like that with ls. The questions are: 1. Is this a (new) GFS2 bug? 2. Is this a recoverable problem (and how)? 3. After a GFS2 file system gets withdrawn, how do we make the node to use it again, without rebooting? Thank you all for your time. Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Mon Mar 16 15:24:15 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 16 Mar 2009 15:24:15 +0000 Subject: [Linux-cluster] 'ls' makes GFS2 to withdraw In-Reply-To: <008d01c9a64b$2ddbb020$89931060$@gr> References: <008d01c9a64b$2ddbb020$89931060$@gr> Message-ID: <1237217055.9571.927.camel@quoit> Hi, Please do not use GFS2 on Centos 5.2, it is rather old. Did you try running fsck.gfs2 ? The results you see look like the readdir() call has worked, but that the stat() call to the directory entry has failed. I'd suggest using Fedora at least until Centos 5.3 is available, Steve. On Mon, 2009-03-16 at 17:23 +0200, Theophanis Kontogiannis wrote: > Hello all, > > > > I have Centos 5.2, kernel 2.6.18-92.1.22.el5.centos.plus, > gfs2-utils-0.1.44-1.el5_2.1 > > > > The cluster is two nodes, using DRBD 8.3.2 as the shared block device, > and CLVM over it, and GFS2 over it. > > > > After an ls in a directory within the GFS2 file system I got the > following errors. > > > > ??????? > > GFS2: fsid=tweety:gfs2-00.0: fatal: invalid metadata block > > GFS2: fsid=tweety:gfs2-00.0: bh = 522538 (magic number) > > GFS2: fsid=tweety:gfs2-00.0: function = gfs2_meta_indirect_buffer, > file = fs/gfs2/meta_io.c, line = 332 > > GFS2: fsid=tweety:gfs2-00.0: about to withdraw this file system > > GFS2: fsid=tweety:gfs2-00.0: telling LM to withdraw > > GFS2: fsid=tweety:gfs2-00.0: withdrawn > > > > Call Trace: > > [] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 > > [] __wait_on_bit+0x60/0x6e > > [] sync_buffer+0x0/0x3f > > [] out_of_line_wait_on_bit+0x6c/0x78 > > [] wake_bit_function+0x0/0x23 > > [] :gfs2:gfs2_meta_check_ii+0x2c/0x38 > > [] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e > > [] :gfs2:gfs2_inode_refresh+0x22/0x2ca > > [] wake_bit_function+0x0/0x23 > > [] :gfs2:inode_go_lock+0x29/0x57 > > [] :gfs2:glock_wait_internal+0x1d4/0x23f > > [] :gfs2:gfs2_glock_nq+0x1ae/0x1d4 > > [] :gfs2:gfs2_lookup+0x58/0xa7 > > [] :gfs2:gfs2_lookup+0x50/0xa7 > > [] d_alloc+0x174/0x1a9 > > [] do_lookup+0xe5/0x1e6 > > [] __link_path_walk+0xa01/0xf42 > > [] zone_statistics+0x3e/0x6d > > [] link_path_walk+0x5c/0xe5 > > [] :gfs2:gfs2_glock_put+0x26/0x133 > > [] do_path_lookup+0x270/0x2e8 > > [] getname+0x15b/0x1c1 > > [] __user_walk_fd+0x37/0x4c > > [] vfs_lstat_fd+0x18/0x47 > > [] sys_newlstat+0x19/0x31 > > [] tracesys+0x71/0xe0 > > [] tracesys+0xd5/0xe0 > > ????????. > > > > > > Obviously ls was not the cause of the problem but it triggered the > events. > > > > From the other node I can have access on the directory that on which > the ?ls? triggered the above. The directory is full of files like > that: > > > > ?--------- ? ? ? ? ? sched_reply > > > > Almost 50% of the files are in shown like that with ls. > > > > The questions are: > > > > 1. Is this a (new) GFS2 bug? > > 2. Is this a recoverable problem (and how)? > > 3. After a GFS2 file system gets withdrawn, how do we make the > node to use it again, without rebooting? > > > > Thank you all for your time. > > > > Theophanis Kontogiannis > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From klmr5883 at gmail.com Mon Mar 16 16:43:35 2009 From: klmr5883 at gmail.com (kali mero) Date: Mon, 16 Mar 2009 16:43:35 +0000 Subject: [Linux-cluster] Post-Join Delay? Message-ID: <40aa95950903160943r2857aedfybd329a92c68ae4cd@mail.gmail.com> Hi all, While I was reading the documentation I came across an awkward statement: "The Post-Join Delay parameter is the number of seconds the fence daemon (fenced) waits before fencing a node after the node joins the fence domain. The Post-Join Delay default value is 3. A typical setting for Post-Join Delay is between 20 and 30 seconds, but can vary according to cluster and network performance." I usually leave this in the default value, but the docs refer to something in between 20 and 30. So, what are your usually settings for this parameters? What is your recommendation? TIA -------------- next part -------------- An HTML attachment was scrubbed... URL: From henry.robertson at hjrconsulting.com Mon Mar 16 17:09:06 2009 From: henry.robertson at hjrconsulting.com (Henry Robertson) Date: Mon, 16 Mar 2009 13:09:06 -0400 Subject: [Linux-cluster] Re: 5.3 -- Multiple Apache + clurgmgrd Message-ID: > > ------------------------------ > > ------------------------------ > > Message: 9 > Date: Mon, 16 Mar 2009 13:03:24 +0100 > From: Marek Grac > Subject: Re: [Linux-cluster] > To: linux clustering > Message-ID: <49BE400C.7030703 at redhat.com> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi, > > Henry Robertson wrote: > > I'm having an issue starting a clustered Apache service in 5.3. > > > > If I do a basic ip->httpd service, I get an error about a missing PID. > > > > snippet from cluster.conf > > > > > recovery="relocate"> > > > > > name="httpd1" server_root="/etc/httpd/httpd1" shutdown_wait="0"/> > > > > > > > > > > Error: > > Mar 15 07:43:25 ag01 clurgmgrd: [18329]: Checking Existence Of > > File /var/run/cluster/apache/apache:httpd1.pid [apache:httpd1] > > > Failed - File Doesn't Exist > > Mar 15 07:43:25 ag01 clurgmgrd: [18329]: Stopping Service > > apache:httpd1 > Failed > > > > If I go into /etc/cluster/apache/apache:httpd/httpd.conf and uncomment > > the ###Listen 192.168.51.88:80 it works > > fine. It also genereates the proper PID. > > > > Ideas on what's going on here? I'm pretty sure it's supposed to > > inherit the Listen address from the cluster.conf file, not have it > > commented out. > > Apache is looking for IP in cluster.conf but as a sibling, not a parent > tag. > > > marx > > > > ------------------------------ > > Turns out it was from bug https://bugzilla.redhat.com/show_bug.cgi?id=489785 After I patched apache.sh and removed all the old httpd.conf's everything worked fine. Regards, Henry Robertson HJR Consulting LLC -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ed.Sanborn at genband.com Mon Mar 16 19:43:26 2009 From: Ed.Sanborn at genband.com (Ed Sanborn) Date: Mon, 16 Mar 2009 15:43:26 -0400 Subject: [Linux-cluster] Samba as clustered service Message-ID: <593E210EDC38444DA1C17E9E9F5E264B98FB1F@GBMDMail01.genband.com> Hi folks, I have a RHEL 5.2 cluster using GFS, etc. I have the GFS filesystem being served up via a clustered NFS resource. But I have Samba running on one of the cluster nodes and NOT yet setup as a clustered service. Can anyone who has Samba working clustered share their experience and/or pointers to install highlights? I found this link but I am questioning it's accuracy as it is RHEL-AS-2.1 related install documentation. Thanks, Ed Ed Sanborn Senior IT Manager GENBAND Inc. 3 Federal Street Billerica, MA 01821 office +1.978 495-3055 mobile +1.978 210-9855 ed.sanborn at genband.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 3407 bytes Desc: image001.gif URL: From Ed.Sanborn at genband.com Mon Mar 16 19:47:25 2009 From: Ed.Sanborn at genband.com (Ed Sanborn) Date: Mon, 16 Mar 2009 15:47:25 -0400 Subject: [Linux-cluster] Samba as clustered service In-Reply-To: <593E210EDC38444DA1C17E9E9F5E264B98FB1F@GBMDMail01.genband.com> References: <593E210EDC38444DA1C17E9E9F5E264B98FB1F@GBMDMail01.genband.com> Message-ID: <593E210EDC38444DA1C17E9E9F5E264B98FB20@GBMDMail01.genband.com> This was the link I referred to: http://www.redhat.com/docs/manuals/enterprise/RHEL-AS-2.1-Manual/cluster -manager/s1-service-samba.html From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Ed Sanborn Sent: Monday, March 16, 2009 3:43 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] Samba as clustered service Hi folks, I have a RHEL 5.2 cluster using GFS, etc. I have the GFS filesystem being served up via a clustered NFS resource. But I have Samba running on one of the cluster nodes and NOT yet setup as a clustered service. Can anyone who has Samba working clustered share their experience and/or pointers to install highlights? I found this link but I am questioning it's accuracy as it is RHEL-AS-2.1 related install documentation. Thanks, Ed Ed Sanborn Senior IT Manager GENBAND Inc. 3 Federal Street Billerica, MA 01821 office +1.978 495-3055 mobile +1.978 210-9855 ed.sanborn at genband.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 3407 bytes Desc: image001.gif URL: From EliasM at dnb.com Tue Mar 17 05:21:56 2009 From: EliasM at dnb.com (Elias, Michael) Date: Tue, 17 Mar 2009 01:21:56 -0400 Subject: [Linux-cluster] Lock contention on BerkleyDB lock file Message-ID: <437A36C1327D794D87D207AC80BDD8FD0B051747@DNBMSXBH002.dnbint.net> Hi all, We are trying to investigate an incident where rogue multiple instances of Jboss caused possible lock contention on a BerkleyDB lock file. This is a 5 node cluster, 3 nodes using a BerkelyDB on a GFS partition. Basic sequence of events; Node1 appears to be the lock master for the BerkleyDB lock file Node2 starts 4 addition JBoss servers (under investigation to how/why they started) that are in an uninterruptible sleep state. Attempted killing the process with kill -9, but not killable as is usual with the dreaded "D" state. Node 1 we stop the JBoss running on that server which seem to release any contention on Node 2, we are able to kill all the JBoss process. We then restart the JBoss server on Node 1. Node 2 we removed the node from the cluster to release any dlm locks it is holding. Stopping the following in this order rgmanger, gfs, clvmd, fenced, cman. Node 2, we restarted the cluster, cman, fenced, clvmd, gfs.. Once starting GFS we start receiving dlm_lock errors and would not mount. At that point dlm seemed to make gfs unavailable on nodes 1,2,and 3. At that point we recycled nodes 1,2,and3 and they came up clean. So questions; 1. In the log info below "ex plock -11"; I believe ex=exclusive plock=posixlock but what does "-11" mean? 2. Why would stopping JBoss on node1 release the contention on node2? 3. Why would node 2 complain about dlm_lock when restarting all the cluster services and lockup the cluster 4. And not necessarily for this group, is it possible for JBoss to start random servers due to some sort of lock contention. Thanks for any help that is offered. Michael The log info below is from node2 and is before any corrective action was taken. It is the state when the additional JBoss processes were running. /proc/cluster/lock_dlm/debug /proc/cluster/lock_dlm/debug:14043 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14043 ex plock -11 /proc/cluster/lock_dlm/debug:14037 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14037 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14037 ex plock -11 /proc/cluster/lock_dlm/debug:14036 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14036 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14036 ex plock -11 /proc/cluster/lock_dlm/debug:14039 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14039 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14039 ex plock -11 /proc/cluster/lock_dlm/debug:22414 en punlock 7,f1021d /proc/cluster/lock_dlm/debug:14038 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14038 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14038 ex plock -11 /proc/cluster/lock_dlm/debug:14041 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14041 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14041 ex plock -11 /proc/cluster/lock_dlm/debug:14041 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14041 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14041 ex plock -11 /proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14040 ex plock -11 /proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14040 ex plock -11 /proc/cluster/lock_dlm/debug:14037 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14037 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14037 ex plock -11 /proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14040 ex plock -11 /proc/cluster/lock_dlm/debug:14041 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14041 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14041 ex plock -11 /proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14040 ex plock -11 /proc/cluster/lock_dlm/debug:14038 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14038 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14038 ex plock -11 /proc/cluster/lock_dlm/debug:14044 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14044 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14044 ex plock -11 /proc/cluster/lock_dlm/debug:14040 en plock 7,25101ee /proc/cluster/lock_dlm/debug:14040 global conflict -11 0-0 ex 1 own 1, pid 1u /proc/cluster/lock_dlm/debug:14040 ex plock -11 That inode hade the following in DLM_locks Resource 00000100b833c508 (parent 0000000000000000). Name (len=24) " 2 25101ee" Local Copy, Master is node 1 Granted Queue 154f0156 PR 14040 Master: 8fe80123 Conversion Queue Waiting Queue Resource 00000101da9a9598 (parent 0000000000000000). Name (len=24) " 5 25101ee" Master Copy Granted Queue 1229013e PR 14040 098f0154 PR 19040 Remote: 5 d1fa036e 190c02f1 PR 6587 Remote: 1 12cb023f 8612028c PR 15187 Remote: 3 00160179 Conversion Queue Waiting Queue Resource 00000101c5d28908 (parent 0000000000000000). Name (len=24) " 11 25101ee" Master Copy Granted Queue 28630181 NL 24849 Remote: 1 8ca6012e 277703b2 NL 32360 86f603bc NL 15187 Remote: 3 0011006d Conversion Queue Waiting Queue Resource 00000100eab6c9d8 (parent 0000000000000000). Name (len=24) " 7 25101ee" Master Copy LVB: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Granted Queue a7dd00b2 EX 21994 0-0 Remote: 1 8858013a 95a603ef NL 16953 86fd00a7 NL 15187 Remote: 3 000e005e 748702cf NL 8385 Remote: 1 a25e0322 Conversion Queue Waiting Queue The inode "25101ee" points to the following file: 38863342 -rw-r--r-- 1 nobody nobody 0 Jan 18 13:19 je.lck nobody 13892 34.1 64.1 8238260 5138044 ? Sl Mar01 6328:11 /usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server -Xms6144m -Xmx6144m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed -classpath /xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default -Djboss.partition.name=PROD1-yyy1 nobody 21803 0.0 53.8 8223868 4311272 ? D 17:59 0:00 /usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server -Xms6144m -Xmx6144m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed -classpath /xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default -Djboss.partition.name=PROD1-yyy1 nobody 21808 0.0 53.8 8223868 4311496 ? D 17:59 0:00 /usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server -Xms6144m -Xmx6144m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed -classpath /xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default -Djboss.partition.name=PROD1-yyy1 nobody 21809 0.0 53.8 8223868 4311524 ? D 17:59 0:00 /usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server -Xms6144m -Xmx6144m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed -classpath /xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default -Djboss.partition.name=PROD1-yyy1 nobody 22414 0.0 55.2 8232092 4419488 ? D 18:03 0:00 /usr/java/jdk1.5.0_16/bin/java -Dprogram.name=run_yyy1.sh -server -Xms6144m -Xmx6144m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000 -Djava.endorsed.dirs=/xxxusr1/jboss/jboss-4.0.5.GA/lib/endorsed -classpath /xxxusr1/jboss/jboss-4.0.5.GA/bin/run.jar:/usr/java/jdk1.5.0_16/lib/tool s.jar org.jboss.Main -c yyy1 -Dbinding-ports=ports-default -Djboss.partition.name=PROD1-yyy1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kadlec at mail.kfki.hu Tue Mar 17 09:29:33 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Tue, 17 Mar 2009 10:29:33 +0100 (CET) Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: On Mon, 16 Mar 2009, Kadlecsik Jozsef wrote: > > commits on whitetank are very locked down so the source should be fairly > > safe for your use. You might be able to obtain a backwards compat > > version by reverting confdb: > > > > home/whitetank: svn diff -r 1709:1710 | patch -p0 --reverse > > home/whitetank: svn diff -r 1713:1714 | patch -p0 --reverse > > openais from svn and the two commits above reverted does not start up: > > Starting /usr/sbin/aisexec aisexec -f > CMAN_NODENAME=saturn-gfs > CMAN_DEBUGLOG=255 > OPENAIS_DEFAULT_CONFIG_IFACE=cmanconfig > CMAN_PIPE=4 > CC: setting logger levels > forked process ID is 8264 > [MAIN ] AIS Executive Service RELEASE 'subrev 1152 version 0.80' > [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. > [MAIN ] Copyright (C) 2006 Red Hat, Inc. > [MAIN ] AIS Executive Service: started and ready to provide service. > [MAIN ] Using override node name saturn-gfs > [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) > [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) > [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) > [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) > [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 > [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) > [TOTEM] send threads (0 threads) > [TOTEM] RRP token expired timeout (495 ms) > [TOTEM] RRP token problem counter (2000 ms) > [TOTEM] RRP threshold (10 problem count) > [TOTEM] RRP mode set to none. > [TOTEM] heartbeat_failures_allowed (0) > [TOTEM] max_network_delay (50 ms) > [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 > [TOTEM] Receive multicast socket recv buffer size (288000 bytes). > [TOTEM] Transmit multicast socket send buffer size (262142 bytes). > [TOTEM] The network interface [192.168.192.18] is now up. > [TOTEM] Created or loaded sequence id 3452.192.168.192.18 for this ring. > [TOTEM] entering GATHER state from 15. > aisexec died: Error, reason code is 4 Comparing with a normal startup, service registering is completely missing. There should be log lines like this, isn't it: [MAIN ] Using override node name saturn-gfs [MAIN ] openais component openais_cpg loaded. [MAIN ] Registering service handler 'openais cluster closed process group service v1.01' ... [MAIN ] openais component openais_cman loaded. [MAIN ] Registering service handler 'openais CMAN membership service 2.01' Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From martin.fuerstenau at oce.com Tue Mar 17 10:04:42 2009 From: martin.fuerstenau at oce.com (Martin Fuerstenau) Date: Tue, 17 Mar 2009 11:04:42 +0100 Subject: [Linux-cluster] Post-Join Delay? In-Reply-To: <40aa95950903160943r2857aedfybd329a92c68ae4cd@mail.gmail.com> References: <40aa95950903160943r2857aedfybd329a92c68ae4cd@mail.gmail.com> Message-ID: <1237284282.25541.15.camel@lx002140.ops.de> Hi, this is from my cluster.conf: It works well for me. The problem with the defaults is that an upcoming node is fenced before it is completely up Yours Senior System Engineer Oc? Printing Systems GmbH On Mon, 2009-03-16 at 16:43 +0000, kali mero wrote: > Hi all, > > While I was reading the documentation I came across an awkward > statement: > > "The Post-Join Delay parameter is the number of seconds the fence > daemon (fenced) waits before fencing a node after the node joins the > fence domain. The Post-Join Delay default value is 3. A typical > setting for Post-Join Delay is between 20 and 30 seconds, but can vary > according to cluster and network performance." > > I usually leave this in the default value, but the docs refer to > something in between 20 and 30. > > So, what are your usually settings for this parameters? > What is your recommendation? > > TIA > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Martin F?rstenau Senior System Engineer Oc? Printing Systems GmbH Siemensallee 2 85586 Poing, Germany Direct Dial +49-(0)8121-72-4684 Direct Fax +49-(0)8121-72-4996 mailto: martin.fuerstenau at oce.com Executive Office: Andr? Mittelsteiner, Manfred Maier Registered Office: Poing, Landkreis Ebersberg ? Commercial Register M?nchen: HRB 113205 - WEEE-Reg.-Nr. DE 88805443 This message and attachment(s) are intended solely for use by the addressee and may contain information that is privileged, confidential or otherwise exempt from disclosure under applicable law. If you are not the intended recipient or agent thereof responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by telephone and with a 'reply' message. Thank you for your co-operation. From kadlec at mail.kfki.hu Tue Mar 17 12:56:54 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Tue, 17 Mar 2009 13:56:54 +0100 (CET) Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: On Tue, 17 Mar 2009, Kadlecsik Jozsef wrote: > > Starting /usr/sbin/aisexec aisexec -f > > CMAN_NODENAME=saturn-gfs > > CMAN_DEBUGLOG=255 > > OPENAIS_DEFAULT_CONFIG_IFACE=cmanconfig > > CMAN_PIPE=4 > > CC: setting logger levels > > forked process ID is 8264 > > [MAIN ] AIS Executive Service RELEASE 'subrev 1152 version 0.80' > > [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. > > [MAIN ] Copyright (C) 2006 Red Hat, Inc. > > [MAIN ] AIS Executive Service: started and ready to provide service. > > [MAIN ] Using override node name saturn-gfs > > [TOTEM] Token Timeout (10000 ms) retransmit timeout (495 ms) > > [TOTEM] token hold (386 ms) retransmits before loss (20 retrans) > > [TOTEM] join (60 ms) send_join (0 ms) consensus (4800 ms) merge (200 ms) > > [TOTEM] downcheck (1000 ms) fail to recv const (50 msgs) > > [TOTEM] seqno unchanged const (30 rotations) Maximum network MTU 1500 > > [TOTEM] window size per rotation (50 messages) maximum messages per rotation (17 messages) > > [TOTEM] send threads (0 threads) > > [TOTEM] RRP token expired timeout (495 ms) > > [TOTEM] RRP token problem counter (2000 ms) > > [TOTEM] RRP threshold (10 problem count) > > [TOTEM] RRP mode set to none. > > [TOTEM] heartbeat_failures_allowed (0) > > [TOTEM] max_network_delay (50 ms) > > [TOTEM] HeartBeat is Disabled. To enable set heartbeat_failures_allowed > 0 > > [TOTEM] Receive multicast socket recv buffer size (288000 bytes). > > [TOTEM] Transmit multicast socket send buffer size (262142 bytes). > > [TOTEM] The network interface [192.168.192.18] is now up. > > [TOTEM] Created or loaded sequence id 3452.192.168.192.18 for this ring. > > [TOTEM] entering GATHER state from 15. > > aisexec died: Error, reason code is 4 > > Comparing with a normal startup, service registering is completely > missing. There should be log lines like this, isn't it: > > [MAIN ] Using override node name saturn-gfs > [MAIN ] openais component openais_cpg loaded. > [MAIN ] Registering service handler 'openais cluster closed process group service v1.01' > ... > [MAIN ] openais component openais_cman loaded. > [MAIN ] Registering service handler 'openais CMAN membership service 2.01' aisexec dies in the function openais_service_link_and_init. Inserting some log_printf line into exec/service.c like this if (!iface_ver0->openais_get_service_handler_ver0) log_printf(LOG_LEVEL_ERROR, "Missing service handle '%s'.\n", service /* * Initialize service */ service = iface_ver0->openais_get_service_handler_ver0(); I got in the debug log [MAIN ] Missing service handle 'openais_cman'. aisexec died with signal: 11 As if lcr_ifact_reference would return a broken object. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From vu at sivell.com Tue Mar 17 14:19:36 2009 From: vu at sivell.com (Vu Pham) Date: Tue, 17 Mar 2009 09:19:36 -0500 Subject: [Linux-cluster] lvmconf --enable-cluster Message-ID: <49BFB178.9040701@sivell.com> When using lvmconf --enable-cluster to enable LVM clustering, do I have to run it everytime the system reboots, or will it write down that enable status onto some config file and will restore that status over reboot ? If that status is recorded, then where is it ? Thanks, Vu From kadlec at mail.kfki.hu Tue Mar 17 14:48:09 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Tue, 17 Mar 2009 15:48:09 +0100 (CET) Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: On Tue, 17 Mar 2009, Kadlecsik Jozsef wrote: > I got in the debug log > > [MAIN ] Missing service handle 'openais_cman'. > aisexec died with signal: 11 > > As if lcr_ifact_reference would return a broken object. Sorry for the noise, my fault. I forgot that the cluster package must be recompiled with the corresponding openais and I ran GFS compiled with openais-0.80.4, that was the reason of the failure. I can confirm that cluster-2.03.11 over openais rev 1754 from svn (without confdb) works fine with cluster-2.01.00 over openais-0.80.3 :-). Best regards, Jzosef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From jaime.alonso.miguel at gmail.com Tue Mar 17 14:48:52 2009 From: jaime.alonso.miguel at gmail.com (Jaime Alonso) Date: Tue, 17 Mar 2009 15:48:52 +0100 Subject: [Linux-cluster] fencing error? Message-ID: <68ca69660903170748s189998f3t678295fc6ac504c@mail.gmail.com> Hi everybody I configured a two node cluster. Both servers are Hp so my fence devices are the ILOs of each server. Aparently I have configured correctly the cluster, but when i start the second node the node 1 reset automatically. Could it be because of the fence? What should I test to know where the error could be? Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From agk at redhat.com Tue Mar 17 14:49:48 2009 From: agk at redhat.com (Alasdair G Kergon) Date: Tue, 17 Mar 2009 14:49:48 +0000 Subject: [Linux-cluster] lvmconf --enable-cluster In-Reply-To: <49BFB178.9040701@sivell.com> References: <49BFB178.9040701@sivell.com> Message-ID: <20090317144948.GJ3063@agk.fab.redhat.com> On Tue, Mar 17, 2009 at 09:19:36AM -0500, Vu Pham wrote: > If that status is recorded, then where is it ? /etc/lvm/lvm.conf Alasdair -- agk at redhat.com From erickson.jon at gmail.com Tue Mar 17 14:54:08 2009 From: erickson.jon at gmail.com (Jon Erickson) Date: Tue, 17 Mar 2009 10:54:08 -0400 Subject: [Linux-cluster] Strange directory listing In-Reply-To: <732180264.246891236362374011.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1635277376.246311236362110482.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <732180264.246891236362374011.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <6a90e4da0903170754r32a703bbgf6b32f9858399376@mail.gmail.com> I'm seeing the same wierd listings....any word on what this is or when it will be fixed. I'm using RHEL 5.3 On Fri, Mar 6, 2009 at 1:59 PM, Bob Peterson wrote: > ----- "Jeff Sturm" wrote: > | > What level of GFS driver is this? ?Are you up2date or running > | > a recent level? > | > | We aren't running Red Hat. ?We have: > | > | CentOS release 5.2 (Final) > | kmod-gfs-0.1.23-5.el5 > > Close enough. ?:7) ?That's a bit old, so it's possible it's the > problem I pointed out. ?Sounds like it's not hurting you anyway. > So hopefully this problem will go away the next time you update, > to Centos5.3 or whatever. > > | The rogue files don't stay around long enough to stat() on all nodes. > | My assumption is that these are files just created or in the process > | of > | being destroyed, and I don't see them in two successive "ls -l" > | commands. > | > | I wasn't too concerned since this doesn't seem to have any negative > | impact on the application, but was curious nonetheless. > > Regards, > > Bob Peterson > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Jon From theophanis_kontogiannis at yahoo.gr Tue Mar 17 14:31:11 2009 From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis) Date: Tue, 17 Mar 2009 16:31:11 +0200 Subject: [Linux-cluster] 'ls' makes GFS2 to withdraw In-Reply-To: <1237217055.9571.927.camel@quoit> References: <008d01c9a64b$2ddbb020$89931060$@gr> <1237217055.9571.927.camel@quoit> Message-ID: <00fc01c9a70d$063a0950$12ae1bf0$@gr> Hello Steve and All, Running gfs2_fsck gave thousands and thousands of the following errors. However now the files that looked corrupted on the 'ls' output are presented with correct properties and file details (size, date etc...). Also the file system looks stable without errors. ....................... Ondisk and fsck bitmaps differ at block 86432027 (0x526d91b) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. Ondisk and fsck bitmaps differ at block 86432028 (0x526d91c) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. Ondisk and fsck bitmaps differ at block 86432029 (0x526d91d) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. Ondisk and fsck bitmaps differ at block 86432030 (0x526d91e) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. Ondisk and fsck bitmaps differ at block 86440684 (0x526faec) Ondisk status is 2 (Invalid) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. RG #86376522 (0x526004a) free count inconsistent: is 45385 should be 48898 Inode count inconsistent: is 3 should be 1 Resource group counts updated Pass5 complete Writing changes to disk gfs2_fsck complete Well I guess the root cause of the corruption was that the systems were fencing each other (aka power cycling) with the file system mounted and some services already started (maybe due to in cluster.conf?). It took me sometime to figure out how to debug the cluster, without having services starting (including the mounting of the file system). Thank you all for your time. Theophanis Kontogiannis > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Steven Whitehouse > Sent: Monday, March 16, 2009 5:24 PM > To: linux clustering > Subject: Re: [Linux-cluster] 'ls' makes GFS2 to withdraw > > Hi, > > Please do not use GFS2 on Centos 5.2, it is rather old. Did you try > running fsck.gfs2 ? > > The results you see look like the readdir() call has worked, but that > the stat() call to the directory entry has failed. I'd suggest using > Fedora at least until Centos 5.3 is available, > > Steve. > > On Mon, 2009-03-16 at 17:23 +0200, Theophanis Kontogiannis wrote: > > Hello all, > > > > > > > > I have Centos 5.2, kernel 2.6.18-92.1.22.el5.centos.plus, > > gfs2-utils-0.1.44-1.el5_2.1 > > > > > > > > The cluster is two nodes, using DRBD 8.3.2 as the shared block device, > > and CLVM over it, and GFS2 over it. > > > > > > > > After an ls in a directory within the GFS2 file system I got the > > following errors. > > > > > > > > ??????? > > > > GFS2: fsid=tweety:gfs2-00.0: fatal: invalid metadata block > > > > GFS2: fsid=tweety:gfs2-00.0: bh = 522538 (magic number) > > > > GFS2: fsid=tweety:gfs2-00.0: function = gfs2_meta_indirect_buffer, > > file = fs/gfs2/meta_io.c, line = 332 > > > > GFS2: fsid=tweety:gfs2-00.0: about to withdraw this file system > > > > GFS2: fsid=tweety:gfs2-00.0: telling LM to withdraw > > > > GFS2: fsid=tweety:gfs2-00.0: withdrawn > > > > > > > > Call Trace: > > > > [] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 > > > > [] __wait_on_bit+0x60/0x6e > > > > [] sync_buffer+0x0/0x3f > > > > [] out_of_line_wait_on_bit+0x6c/0x78 > > > > [] wake_bit_function+0x0/0x23 > > > > [] :gfs2:gfs2_meta_check_ii+0x2c/0x38 > > > > [] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e > > > > [] :gfs2:gfs2_inode_refresh+0x22/0x2ca > > > > [] wake_bit_function+0x0/0x23 > > > > [] :gfs2:inode_go_lock+0x29/0x57 > > > > [] :gfs2:glock_wait_internal+0x1d4/0x23f > > > > [] :gfs2:gfs2_glock_nq+0x1ae/0x1d4 > > > > [] :gfs2:gfs2_lookup+0x58/0xa7 > > > > [] :gfs2:gfs2_lookup+0x50/0xa7 > > > > [] d_alloc+0x174/0x1a9 > > > > [] do_lookup+0xe5/0x1e6 > > > > [] __link_path_walk+0xa01/0xf42 > > > > [] zone_statistics+0x3e/0x6d > > > > [] link_path_walk+0x5c/0xe5 > > > > [] :gfs2:gfs2_glock_put+0x26/0x133 > > > > [] do_path_lookup+0x270/0x2e8 > > > > [] getname+0x15b/0x1c1 > > > > [] __user_walk_fd+0x37/0x4c > > > > [] vfs_lstat_fd+0x18/0x47 > > > > [] sys_newlstat+0x19/0x31 > > > > [] tracesys+0x71/0xe0 > > > > [] tracesys+0xd5/0xe0 > > > > ????????. > > > > > > > > > > > > Obviously ls was not the cause of the problem but it triggered the > > events. > > > > > > > > From the other node I can have access on the directory that on which > > the ?ls? triggered the above. The directory is full of files like > > that: > > > > > > > > ?--------- ? ? ? ? ? sched_reply > > > > > > > > Almost 50% of the files are in shown like that with ls. > > > > > > > > The questions are: > > > > > > > > 1. Is this a (new) GFS2 bug? > > > > 2. Is this a recoverable problem (and how)? > > > > 3. After a GFS2 file system gets withdrawn, how do we make the > > node to use it again, without rebooting? > > > > > > > > Thank you all for your time. > > > > > > > > Theophanis Kontogiannis > > > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Tue Mar 17 15:02:10 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 17 Mar 2009 15:02:10 +0000 Subject: [Linux-cluster] Strange directory listing In-Reply-To: <6a90e4da0903170754r32a703bbgf6b32f9858399376@mail.gmail.com> References: <1635277376.246311236362110482.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <732180264.246891236362374011.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <6a90e4da0903170754r32a703bbgf6b32f9858399376@mail.gmail.com> Message-ID: <4ec79e7c0b9e28d71982d04b978ba771@localhost> Is that to say that GFS2 is still, 18 months later, not production stable? On Tue, 17 Mar 2009 10:54:08 -0400, Jon Erickson wrote: > I'm seeing the same wierd listings....any word on what this is or when > it will be fixed. I'm using RHEL 5.3 > > On Fri, Mar 6, 2009 at 1:59 PM, Bob Peterson wrote: >> ----- "Jeff Sturm" wrote: >> | > What level of GFS driver is this? ?Are you up2date or running >> | > a recent level? >> | >> | We aren't running Red Hat. ?We have: >> | >> | CentOS release 5.2 (Final) >> | kmod-gfs-0.1.23-5.el5 >> >> Close enough. ?:7) ?That's a bit old, so it's possible it's the >> problem I pointed out. ?Sounds like it's not hurting you anyway. >> So hopefully this problem will go away the next time you update, >> to Centos5.3 or whatever. >> >> | The rogue files don't stay around long enough to stat() on all nodes. >> | My assumption is that these are files just created or in the process >> | of >> | being destroyed, and I don't see them in two successive "ls -l" >> | commands. >> | >> | I wasn't too concerned since this doesn't seem to have any negative >> | impact on the application, but was curious nonetheless. >> >> Regards, >> >> Bob Peterson >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> From vu at sivell.com Tue Mar 17 15:06:40 2009 From: vu at sivell.com (Vu Pham) Date: Tue, 17 Mar 2009 10:06:40 -0500 Subject: [Linux-cluster] lvmconf --enable-cluster In-Reply-To: <20090317144948.GJ3063@agk.fab.redhat.com> References: <49BFB178.9040701@sivell.com> <20090317144948.GJ3063@agk.fab.redhat.com> Message-ID: <49BFBC80.6060202@sivell.com> Alasdair G Kergon wrote: > On Tue, Mar 17, 2009 at 09:19:36AM -0500, Vu Pham wrote: >> If that status is recorded, then where is it ? > > /etc/lvm/lvm.conf > Thanks, Alasdair. [root at vm1 lvm]# diff lvm.conf lvm.conf.sav 237c237 < locking_type = 1 --- > locking_type = 3 [root at vm1 lvm]# I did check this before my post but could not see the difference because this vm1 machine already has this command run on it already :) Thanks again. Vu From sdake at redhat.com Tue Mar 17 17:40:16 2009 From: sdake at redhat.com (Steven Dake) Date: Tue, 17 Mar 2009 10:40:16 -0700 Subject: [Linux-cluster] www.openais.org disappeared: from where to donwload openais? In-Reply-To: References: <1236967303.28410.219.camel@sdake-laptop> Message-ID: <1237311616.6053.1.camel@sdake-laptop> On Tue, 2009-03-17 at 15:48 +0100, Kadlecsik Jozsef wrote: > On Tue, 17 Mar 2009, Kadlecsik Jozsef wrote: > > > I got in the debug log > > > > [MAIN ] Missing service handle 'openais_cman'. > > aisexec died with signal: 11 > > > > As if lcr_ifact_reference would return a broken object. > > Sorry for the noise, my fault. I forgot that the cluster package must be > recompiled with the corresponding openais and I ran GFS compiled with > openais-0.80.4, that was the reason of the failure. > > I can confirm that cluster-2.03.11 over openais rev 1754 from svn (without > confdb) works fine with cluster-2.01.00 over openais-0.80.3 :-). > > Best regards, > Jzosef Thanks for the confirmation. We will definitely be working on sorting the confdb breakage out. Regards -steve > -- > E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: KFKI Research Institute for Particle and Nuclear Physics > H-1525 Budapest 114, POB. 49, Hungary > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From virginian at blueyonder.co.uk Tue Mar 17 17:56:42 2009 From: virginian at blueyonder.co.uk (Virginian) Date: Tue, 17 Mar 2009 17:56:42 -0000 Subject: [Linux-cluster] fencing error? References: <68ca69660903170748s189998f3t678295fc6ac504c@mail.gmail.com> Message-ID: Please post your cluster.conf ----- Original Message ----- From: Jaime Alonso To: linux clustering Sent: Tuesday, March 17, 2009 2:48 PM Subject: [Linux-cluster] fencing error? Hi everybody I configured a two node cluster. Both servers are Hp so my fence devices are the ILOs of each server. Aparently I have configured correctly the cluster, but when i start the second node the node 1 reset automatically. Could it be because of the fence? What should I test to know where the error could be? Thanks in advance. ------------------------------------------------------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From bmarzins at redhat.com Tue Mar 17 18:27:20 2009 From: bmarzins at redhat.com (Benjamin Marzinski) Date: Tue, 17 Mar 2009 13:27:20 -0500 Subject: [Linux-cluster] can't import gnbd ? In-Reply-To: <014201c9a1f3$5f735a90$1e5a0fb0$@net> References: <025701c9a161$8575f690$9061e3b0$@net> <20090310172649.GI32340@ether.msp.redhat.com> <014201c9a1f3$5f735a90$1e5a0fb0$@net> Message-ID: <20090317182719.GN32340@ether.msp.redhat.com> On Wed, Mar 11, 2009 at 10:45:00AM +0800, Figaro Yang wrote: > Dear Ben : > > Thank you for your kindly reply. I still have another question as follows: > > the host client is one of GNBD client in my cluster, does it need to start > cman service at client host? > The host client is not suppose to be included in /etc/cluster/cluster.conf. Both the GNBD clients and the servers need to be included in /etc/cluster/cluster.conf, and have cman running on them. > > My cluster architecture is as below: > > Hosta ( GFS / GNBD Server ) ---------| > Hostb ( GFS / GNBD Server ) ---------| ------------ SAN Storage > Hostc ( GFS / GNBD Server ) ---------| > | > | > | > --------------- client ( GNBD Client ) > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Benjamin Marzinski > Sent: Wednesday, March 11, 2009 1:27 AM > To: linux clustering > Subject: Re: [Linux-cluster] can't import gnbd ? > > On Tue, Mar 10, 2009 at 05:20:57PM +0800, Figaro Yang wrote: > > Hi , ALL .. > > > > > > > > I set up a File Cluster consist of 3 GNBD Server , using gnbd server > for > > fencing , in case , when I on client node to gnbd_import , system have > > return some error message .. how can I solve it ? thank all .. > > > > > > > > [root at client ~]# modprobe gnbd > > > > [root at client ~]# gnbd_import -i io3 > > > > gnbd_import: ERROR cannot get node name : No such file or directory > > > > gnbd_import: ERROR If you are not planning to use a cluster manager, > use > > -n > > Did you start up the cman first? > > Unless you are planning to use gnbd outside of a cluster (and if you > want fencing, you need to use run gnbd in a cluster), you need to start > cman first. gnbd_import failed when it called cman_init(). The most > likely reason for this is that your cluster is not started yet. > > -Ben > > > > > [root at client ~]# tail /var/log/messages > > > > Mar 10 04:44:54 client smartd[3719]: Device: /dev/sda, is SMART > capable. > > Adding to "monitor" list. > > > > Mar 10 04:44:54 client smartd[3719]: Monitoring 1 ATA and 0 SCSI > devices > > > > Mar 10 04:44:55 client smartd[3721]: smartd has fork()ed into > background > > mode. New PID=3721. > > > > Mar 10 04:44:55 client pcscd: winscard.c:304:SCardConnect() Reader > E-Gate > > 0 0 Not Found > > > > Mar 10 04:44:55 client last message repeated 3 times > > > > Mar 10 04:44:56 client kernel: mtrr: type mismatch for d8000000,2000000 > > old: uncachable new: write-combining > > > > Mar 10 05:15:15 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] > > cman_init failed : No such file or directory > > > > Mar 10 05:15:17 client last message repeated 2 times > > > > Mar 10 05:15:21 client kernel: gnbd: registered device at major 252 > > > > Mar 10 05:15:24 client gnbd_import: ERROR [../../utils/gnbd_utils.c:78] > > cman_init failed : No such file or directory > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From nehemiasjahcob at gmail.com Tue Mar 17 19:45:56 2009 From: nehemiasjahcob at gmail.com (Nehemias Jahcob) Date: Tue, 17 Mar 2009 15:45:56 -0400 Subject: [Linux-cluster] fencing error? In-Reply-To: References: <68ca69660903170748s189998f3t678295fc6ac504c@mail.gmail.com> Message-ID: <5f61ab380903171245n4b11231eo428ecd984c925370@mail.gmail.com> Tu problema es que estas cruzando la ILO, cada ILO debe ser el fence de cada maquina.. Algo asi. Cluster.conf.. ######################################################################## ########################################################################## Saludos. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jds at techma.com Tue Mar 17 22:03:12 2009 From: jds at techma.com (Simmons, Dan A) Date: Tue, 17 Mar 2009 18:03:12 -0400 Subject: [Linux-cluster] What does this mean? "Resource groups locked, not evaluating" In-Reply-To: References: Message-ID: <35440D75A88EF04C81FDEC03F4F739F7028A1B@tmaemail.techma.com> Hi All, What does this mean? "Resource groups locked, not evaluating" I see it in my logs for a RHEL 4u7 after it boots up. Functionally everything seems to be working but I don't recall seeing that warning before. Thanks, JDan From janne.peltonen at helsinki.fi Wed Mar 18 06:35:19 2009 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Wed, 18 Mar 2009 08:35:19 +0200 Subject: [Linux-cluster] IP monitor_link timeout Message-ID: <20090318063519.GC25274@helsinki.fi> Hi! I was wondering if there be a way to increase the timeout of the IP address link monitor check. In my cluster setup, it takes abt 30 seconds to relocate a failed service to another node (or, alternatively, restart it on the same node); now, if a network link is down for, say, 25 seconds, it doesn't make much sense to relocate the service, especially as I'll have to find a time to migrate the service back to the original node. I could disable the monitor_link option completely, I guess, but if the network interface on one node really fails, it'd be nice to have the service relocate automatically to another node. (The node won't necessarily be kicked from the cluster at such a situation, since the cluster token traffic is via another network interface.) Thanks for any advice. Janne Peltonen IT dept University of Helsinki -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) From mpartio at gmail.com Wed Mar 18 07:42:41 2009 From: mpartio at gmail.com (Mikko Partio) Date: Wed, 18 Mar 2009 09:42:41 +0200 Subject: [Linux-cluster] Problems with cluster (fencing?) Message-ID: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com> Hello all I have a two-node cluster with a quorum disk. When I pull off the power cord from one node, the other node freezes the shared gfs-volumes and all activity stops, even though the cluster maintains quorum. When the other node boots up, I can see that "starting fencing" takes many minutes and afterwards starting clvmd fails. That node therefore cannot mount gfs disks since the underlying lvm volumes are missing. Also, if I shut down both nodes and start just one of them, the starting node still waits in the "starting fencing" part many minutes even though the cluster should be quorate (there's a quorum disk)! Fencing method used is HP iLO 2. I don't remember seeing this in CentOS 5.1 (now running 5.2). Any clue what might cause this? Regards Mikko -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Wed Mar 18 12:24:22 2009 From: carlopmart at gmail.com (carlopmart) Date: Wed, 18 Mar 2009 13:24:22 +0100 Subject: [Linux-cluster] action status doesn't works with rhel 5.3? Message-ID: <49C0E7F6.100@gmail.com> Hi all, i have setup a cluster with two nodes and I need to modify when the service's status is checked. To do this I have put this on cluster.conf: But status service is checked every 30 seconds ... What am I doing wrong?? Many thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From vu at sivell.com Wed Mar 18 17:30:50 2009 From: vu at sivell.com (vu pham) Date: Wed, 18 Mar 2009 11:30:50 -0600 Subject: [Linux-cluster] cannot add journal to gfs Message-ID: <49C12FCA.7050904@sivell.com> Although my gfs partition has a lot of free space to add two more journal, but gfs_jadd complains of not enough space. Do I have to run any extra command to make it work ? [root at vm1 gfsdata]# gfs_tool df /mnt/gfsdata /mnt/gfsdata: SB lock proto = "lock_dlm" SB lock table = "cluster1:gfs2" SB ondisk format = 1309 SB multihost format = 1401 Block size = 4096 Journals = 2 Resource Groups = 10 Mounted lock proto = "lock_dlm" Mounted lock table = "cluster1:gfs2" Mounted host data = "jid=0:id=262146:first=1" Journal number = 0 Lock module flags = 0 Local flocks = FALSE Local caching = FALSE Oopses OK = FALSE Type Total Used Free use% ------------------------------------------------------------------------ inodes 6 6 0 100% metadata 63 0 63 0% data 383867 0 383867 0% [root at vm1 gfsdata]# gfs_jadd -Tv -j 2 /mnt/gfsdata Requested size (65536 blocks) greater than available space (2 blocks) [root at vm1 gfsdata]# Thanks, Vu From vu at sivell.com Wed Mar 18 17:44:43 2009 From: vu at sivell.com (vu pham) Date: Wed, 18 Mar 2009 11:44:43 -0600 Subject: [Linux-cluster] Re: cannot add journal to gfs In-Reply-To: <49C12FCA.7050904@sivell.com> References: <49C12FCA.7050904@sivell.com> Message-ID: <49C1330B.3050609@sivell.com> Oh, just ignore this stupid question. I didn't read the man page carefully. This sentence from the man page solves the problem: "gfs_jadd will not use space that has been formatted for filesystem data even if that space has never been populated with files." I extended the lvm and then grew the gfs. I should not grow it so that I have space for journals. Sorry to post questions before checking the problem carefully. vu pham wrote: > Although my gfs partition has a lot of free space to add two more > journal, but gfs_jadd complains of not enough space. > > Do I have to run any extra command to make it work ? > > > > [root at vm1 gfsdata]# gfs_tool df /mnt/gfsdata > /mnt/gfsdata: > SB lock proto = "lock_dlm" > SB lock table = "cluster1:gfs2" > SB ondisk format = 1309 > SB multihost format = 1401 > Block size = 4096 > Journals = 2 > Resource Groups = 10 > Mounted lock proto = "lock_dlm" > Mounted lock table = "cluster1:gfs2" > Mounted host data = "jid=0:id=262146:first=1" > Journal number = 0 > Lock module flags = 0 > Local flocks = FALSE > Local caching = FALSE > Oopses OK = FALSE > > Type Total Used Free use% > ------------------------------------------------------------------------ > inodes 6 6 0 100% > metadata 63 0 63 0% > data 383867 0 383867 0% > > [root at vm1 gfsdata]# gfs_jadd -Tv -j 2 /mnt/gfsdata > Requested size (65536 blocks) greater than available space (2 blocks) > [root at vm1 gfsdata]# > > > > Thanks, > Vu > From rpeterso at redhat.com Wed Mar 18 17:21:36 2009 From: rpeterso at redhat.com (Bob Peterson) Date: Wed, 18 Mar 2009 13:21:36 -0400 (EDT) Subject: [Linux-cluster] cannot add journal to gfs In-Reply-To: <1061745359.1024731237396884517.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <722087759.1024781237396896773.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "vu pham" wrote: | Although my gfs partition has a lot of free space to add two more | journal, but gfs_jadd complains of not enough space. | | Do I have to run any extra command to make it work ? Hi, This is a common complaint about gfs. Frankly, I'm surprised that it's not in the FAQ (I should add it). GFS will only let you add a journal if you have space on your DEVICE that's not already dedicated to the file system. In other words, you have to add storage to the volume, then do gfs_jadd BEFORE doing gfs_grow. If you use gfs_grow after adding storage, it will allocate all the free space to the file system and therefore has no room for any new journals. This problem was addressed in GFS2 where the journals are actually part of the file system. So in GFS2 you can add journals dynamically. In GFS, you have to add storage, resize your lv, then gfs_jadd. Regards, Bob Peterson Red Hat GFS From erickson.jon at gmail.com Wed Mar 18 18:22:01 2009 From: erickson.jon at gmail.com (Jon Erickson) Date: Wed, 18 Mar 2009 14:22:01 -0400 Subject: [Linux-cluster] Fence Device Question Message-ID: <6a90e4da0903181122s13fd4eb5ic3e9383d44da4326@mail.gmail.com> All, I currently use the fence_mcdata script (with slight mod) to provide fencing to my DS-4700M switch. I have two questions: 1. The username and password are stored plain text within the cluster.conf file. Is there a way to make this more secure? (password script?) 2. fence_mcdata works by making a telnet connection to my switch, this is also plain text. I know the switch can support SSH. Does anyone have any expirence using SSH to log into a switch to block ports? Is there a fence_mcdata_ssh script :). Thanks in advance. -Jon From vu at sivell.com Wed Mar 18 19:54:53 2009 From: vu at sivell.com (vu pham) Date: Wed, 18 Mar 2009 13:54:53 -0600 Subject: [Linux-cluster] cannot add journal to gfs In-Reply-To: <722087759.1024781237396896773.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <722087759.1024781237396896773.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <49C1518D.5080909@sivell.com> Bob Peterson wrote: > ----- "vu pham" wrote: > | Although my gfs partition has a lot of free space to add two more > | journal, but gfs_jadd complains of not enough space. > | > | Do I have to run any extra command to make it work ? > > Hi, > > This is a common complaint about gfs. Frankly, I'm surprised > that it's not in the FAQ (I should add it). GFS will only let you > add a journal if you have space on your DEVICE that's not already > dedicated to the file system. In other words, you have to add > storage to the volume, then do gfs_jadd BEFORE doing gfs_grow. > If you use gfs_grow after adding storage, it will allocate all the > free space to the file system and therefore has no room for any > new journals. > > This problem was addressed in GFS2 where the journals are actually > part of the file system. So in GFS2 you can add journals dynamically. > In GFS, you have to add storage, resize your lv, then gfs_jadd. > Thanks, Bob. It works as expected now. Vu From moya at latertulia.org Wed Mar 18 18:59:21 2009 From: moya at latertulia.org (Maykel Moya) Date: Wed, 18 Mar 2009 14:59:21 -0400 Subject: [Linux-cluster] Service is not migrated when owner fails Message-ID: <3b81bca30903181159v703ac47cj3a17feff16e2407@mail.gmail.com> I have a 4-node cluster, each node running one of four services. Each service is an ip/fs combination. I'm trying to test service failover. After disconnecting the network to one of the nodes (ip link set eth0 down), its running service is not migrated to another node until the node get successfully fenced. I tried to add 'recovery="relocated"' to declaration but in that case the service is relocated when the failing node is back online after a succesful fence. I'd like the service being migrated to another node as soon as a fail in the current owner node gets detected. Regards, maykel ==== .... .... From Gary_Hunt at gallup.com Wed Mar 18 20:47:25 2009 From: Gary_Hunt at gallup.com (Hunt, Gary) Date: Wed, 18 Mar 2009 15:47:25 -0500 Subject: [Linux-cluster] Problems with cluster (fencing?) In-Reply-To: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com> References: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com> Message-ID: I was fighting a very similar issue today. I am not familiar with the fencing you are using, but I would guess your fence device is not working properly. If a node fails and the fencing doesn't succeed it will halt all gfs activity. If a clustat shows both nodes and the quorum disk online, but no rgmanager try running a fence_tool leave and fence_tool join on both nodes. That worked for me today. Starting one node with the other node down is failing because it is trying to fence all nodes not present before proceeding. I am testing clean_start="1" in the cluster.conf. It has worked well so far. I would definitely read the man page for fenced about clean_start before using it. It does have some risks. Gary From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mikko Partio Sent: Wednesday, March 18, 2009 2:43 AM To: linux clustering Subject: [Linux-cluster] Problems with cluster (fencing?) Hello all I have a two-node cluster with a quorum disk. When I pull off the power cord from one node, the other node freezes the shared gfs-volumes and all activity stops, even though the cluster maintains quorum. When the other node boots up, I can see that "starting fencing" takes many minutes and afterwards starting clvmd fails. That node therefore cannot mount gfs disks since the underlying lvm volumes are missing. Also, if I shut down both nodes and start just one of them, the starting node still waits in the "starting fencing" part many minutes even though the cluster should be quorate (there's a quorum disk)! Fencing method used is HP iLO 2. I don't remember seeing this in CentOS 5.1 (now running 5.2). Any clue what might cause this? Regards Mikko ________________________________ IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Gary_Hunt at gallup.com Wed Mar 18 21:23:20 2009 From: Gary_Hunt at gallup.com (Hunt, Gary) Date: Wed, 18 Mar 2009 16:23:20 -0500 Subject: [Linux-cluster] quorum disk votes Message-ID: Is there a way to get a cluster node to recognize that the number of votes a quorum disk gets has changed? I added a new node to the cluster and updated the cluster.conf to reflect the changes and propagated it. In this case I went from 3 total votes and a quorum disk vote of 1 to 5 total votes and quorum disk votes of 2. A cman_tool status showed that the total expected votes went to 5, but the quorum device votes stayed at 1. A reboot of the node fixes this, but would like to do this without risking disruption Thanks Gary ________________________________ IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vu at sivell.com Thu Mar 19 00:14:20 2009 From: vu at sivell.com (Vu Pham) Date: Wed, 18 Mar 2009 19:14:20 -0500 Subject: [Linux-cluster] quorum disk votes In-Reply-To: References: Message-ID: <49C18E5C.8090307@sivell.com> Hunt, Gary wrote: > Is there a way to get a cluster node to recognize that the number of > votes a quorum disk gets has changed? I added a new node to the cluster > and updated the cluster.conf to reflect the changes and propagated it. > In this case I went from 3 total votes and a quorum disk vote of 1 to 5 > total votes and quorum disk votes of 2. > > > > A cman_tool status showed that the total expected votes went to 5, but > the quorum device votes stayed at 1. A reboot of the node fixes this, > but would like to do this without risking disruption > > Have you tried "service qdiskd restart" ? Vu From Harri.Paivaniemi at tieto.com Thu Mar 19 05:31:54 2009 From: Harri.Paivaniemi at tieto.com (Harri.Paivaniemi at tieto.com) Date: Thu, 19 Mar 2009 07:31:54 +0200 Subject: [Linux-cluster] Problems with cluster (fencing?) References: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com> Message-ID: <41E8D4F07FCE154CBEBAA60FFC92F67754D4CE@apollo.eu.tieto.com> Nothing to say for the first part, but this: ""Also, if I shut down both nodes and start just one of them, the starting node still waits in the "starting fencing" part many minutes even though the cluster should be quorate (there's a quorum disk)! "" I had a similar situation and the reason why the first node couldn't get up alone was, that cman was starting before qdiskd and so it didn't see quorum disk votes and was not quorate at the moment. I changed those (boot order) vice-versa and immediately node boots up ok and is up'n running... -hjp -----Original Message----- From: linux-cluster-bounces at redhat.com on behalf of Hunt, Gary Sent: Wed 3/18/2009 22:47 To: linux clustering Subject: RE: [Linux-cluster] Problems with cluster (fencing?) I was fighting a very similar issue today. I am not familiar with the fencing you are using, but I would guess your fence device is not working properly. If a node fails and the fencing doesn't succeed it will halt all gfs activity. If a clustat shows both nodes and the quorum disk online, but no rgmanager try running a fence_tool leave and fence_tool join on both nodes. That worked for me today. Starting one node with the other node down is failing because it is trying to fence all nodes not present before proceeding. I am testing clean_start="1" in the cluster.conf. It has worked well so far. I would definitely read the man page for fenced about clean_start before using it. It does have some risks. Gary From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Mikko Partio Sent: Wednesday, March 18, 2009 2:43 AM To: linux clustering Subject: [Linux-cluster] Problems with cluster (fencing?) Hello all I have a two-node cluster with a quorum disk. When I pull off the power cord from one node, the other node freezes the shared gfs-volumes and all activity stops, even though the cluster maintains quorum. When the other node boots up, I can see that "starting fencing" takes many minutes and afterwards starting clvmd fails. That node therefore cannot mount gfs disks since the underlying lvm volumes are missing. Also, if I shut down both nodes and start just one of them, the starting node still waits in the "starting fencing" part many minutes even though the cluster should be quorate (there's a quorum disk)! Fencing method used is HP iLO 2. I don't remember seeing this in CentOS 5.1 (now running 5.2). Any clue what might cause this? Regards Mikko ________________________________ IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message. -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 4269 bytes Desc: not available URL: From corey.kovacs at gmail.com Thu Mar 19 05:37:48 2009 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Thu, 19 Mar 2009 05:37:48 +0000 Subject: [Linux-cluster] What does this mean? "Resource groups locked, not evaluating" In-Reply-To: <35440D75A88EF04C81FDEC03F4F739F7028A1B@tmaemail.techma.com> References: <35440D75A88EF04C81FDEC03F4F739F7028A1B@tmaemail.techma.com> Message-ID: <7d6e8da40903182237i8b1ed03m4c68e5b59e724e51@mail.gmail.com> Dan, a quick google reveals the patch from whence this log originates and appears to have replaced a message that read "Services locked". If you are seeing these only after bootup, I suspect that it's just a condition of the cluster booting up and rgmanager not allowing things to flip from node to node during the boot process until things settle down. Lon, you seem to have written the patch, can you confirm this? Corey On Tue, Mar 17, 2009 at 10:03 PM, Simmons, Dan A wrote: > Hi All, > > What does this mean? ?"Resource groups locked, not evaluating" > > I see it in my logs for a RHEL 4u7 after it boots up. ?Functionally > everything seems to be working but I don't recall seeing that warning > before. > > Thanks, > JDan > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From mpartio at gmail.com Thu Mar 19 06:55:57 2009 From: mpartio at gmail.com (Mikko Partio) Date: Thu, 19 Mar 2009 08:55:57 +0200 Subject: [Linux-cluster] Problems with cluster (fencing?) In-Reply-To: <41E8D4F07FCE154CBEBAA60FFC92F67754D4CE@apollo.eu.tieto.com> References: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com> <41E8D4F07FCE154CBEBAA60FFC92F67754D4CE@apollo.eu.tieto.com> Message-ID: <2ca799770903182355g55934fc1qed5a258e731a58e9@mail.gmail.com> 2009/3/19 > > > ""Also, if I shut down both nodes and start just one of them, the starting > node still waits in the "starting fencing" part many minutes even though the > cluster should be quorate (there's a quorum disk)! > "" > > I had a similar situation and the reason why the first node couldn't get up > alone was, that cman was starting before qdiskd and so it didn't see quorum > disk votes and was not quorate at the moment. I changed those (boot order) > vice-versa and immediately node boots up ok and is up'n running... > That does make sense. Do you know if there's a reason for cman to start before qdisk or is it just a bug? Any Red Hat developers looking at this thread? Regards Mikko -------------- next part -------------- An HTML attachment was scrubbed... URL: From harri.paivaniemi at tieto.com Thu Mar 19 07:45:31 2009 From: harri.paivaniemi at tieto.com (=?iso-8859-1?q?H=2EP=E4iv=E4niemi?=) Date: Thu, 19 Mar 2009 09:45:31 +0200 Subject: [Linux-cluster] Problems with cluster (fencing?) In-Reply-To: <2ca799770903182355g55934fc1qed5a258e731a58e9@mail.gmail.com> References: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com> <41E8D4F07FCE154CBEBAA60FFC92F67754D4CE@apollo.eu.tieto.com> <2ca799770903182355g55934fc1qed5a258e731a58e9@mail.gmail.com> Message-ID: <200903190945.32071.harri.paivaniemi@tieto.com> I made a support query to official RH support for that, but they just couldn't understand what I was saying and after all they said "it's normal" ;) To my mind it was a bug... but this was what RH support answered: " The problem with starting a cluster with one node and qdisk is that there needs to be an established cluster before the quorum disk can be added to the cluster. So if you would like to start the cluster as a single node that is a special configuration which has to be considered...and if you would like to drop a node for maintenance then you need to configure the cluster as a 3 node cluster with the third node consisting of the qdisk." -hjp On Thursday 19 March 2009 08:55:57 Mikko Partio wrote: > 2009/3/19 > > > ""Also, if I shut down both nodes and start just one of them, the > > starting node still waits in the "starting fencing" part many minutes > > even though the cluster should be quorate (there's a quorum disk)! > > "" > > > > I had a similar situation and the reason why the first node couldn't get > > up alone was, that cman was starting before qdiskd and so it didn't see > > quorum disk votes and was not quorate at the moment. I changed those > > (boot order) vice-versa and immediately node boots up ok and is up'n > > running... > > That does make sense. Do you know if there's a reason for cman to start > before qdisk or is it just a bug? Any Red Hat developers looking at this > thread? > > Regards > > Mikko From vu at sivell.com Thu Mar 19 13:35:17 2009 From: vu at sivell.com (vu pham) Date: Thu, 19 Mar 2009 07:35:17 -0600 Subject: [Linux-cluster] Problems with cluster (fencing?) In-Reply-To: <2ca799770903182355g55934fc1qed5a258e731a58e9@mail.gmail.com> References: <2ca799770903180042o445d5f34hf4ec1bd6bbb16b1a@mail.gmail.com><41E8D4F07 FCE154CBEBAA60FFC92F67754D4CE@apollo.eu.tieto.com> <2ca799770903182355g55934fc1qed5a258e731a58e9@mail.gmail.com> Message-ID: <49C24A15.5030701@sivell.com> Mikko Partio wrote: > > > 2009/3/19 > > > > ""Also, if I shut down both nodes and start just one of them, the > starting node still waits in the "starting fencing" part many > minutes even though the cluster should be quorate (there's a quorum > disk)! > "" > > I had a similar situation and the reason why the first node couldn't > get up alone was, that cman was starting before qdiskd and so it > didn't see quorum disk votes and was not quorate at the moment. I > changed those (boot order) vice-versa and immediately node boots up > ok and is up'n running... > > > > That does make sense. Do you know if there's a reason for cman to start > before qdisk or is it just a bug? Any Red Hat developers looking at this > thread? > The section *Limitations* in man page of qdisk(5), in my RHEL 5.2, has : " * CMAN must be running before the qdisk program can operate in full capacity. If CMAN is not running, qdisk will wait for it." Vu From carlopmart at gmail.com Thu Mar 19 14:52:37 2009 From: carlopmart at gmail.com (carlopmart) Date: Thu, 19 Mar 2009 15:52:37 +0100 Subject: [Linux-cluster] Re: action status doesn't works with rhel 5.3? In-Reply-To: <49C0E7F6.100@gmail.com> References: <49C0E7F6.100@gmail.com> Message-ID: <49C25C35.4080501@gmail.com> carlopmart wrote: > Hi all, > > i have setup a cluster with two nodes and I need to modify when the > service's status is checked. To do this I have put this on cluster.conf: > > recovery="relocate"> > > > > > > But status service is checked every 30 seconds ... What am I doing wrong?? > > Many thanks. > Please, any ideas?? -- CL Martinez carlopmart {at} gmail {d0t} com From Gary_Hunt at gallup.com Thu Mar 19 15:21:39 2009 From: Gary_Hunt at gallup.com (Hunt, Gary) Date: Thu, 19 Mar 2009 10:21:39 -0500 Subject: [Linux-cluster] quorum disk votes In-Reply-To: <49C18E5C.8090307@sivell.com> References: <49C18E5C.8090307@sivell.com> Message-ID: I tried restarting qdiskd and I even tried using luci to remove the node from the cluster and add it back in. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vu Pham Sent: Wednesday, March 18, 2009 7:14 PM To: linux clustering Subject: Re: [Linux-cluster] quorum disk votes Hunt, Gary wrote: > Is there a way to get a cluster node to recognize that the number of > votes a quorum disk gets has changed? I added a new node to the cluster > and updated the cluster.conf to reflect the changes and propagated it. > In this case I went from 3 total votes and a quorum disk vote of 1 to 5 > total votes and quorum disk votes of 2. > > > > A cman_tool status showed that the total expected votes went to 5, but > the quorum device votes stayed at 1. A reboot of the node fixes this, > but would like to do this without risking disruption > > Have you tried "service qdiskd restart" ? Vu -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message. From jaime.alonso.miguel at gmail.com Thu Mar 19 17:58:28 2009 From: jaime.alonso.miguel at gmail.com (Jaime Alonso) Date: Thu, 19 Mar 2009 18:58:28 +0100 Subject: [Linux-cluster] fencing error? In-Reply-To: <5f61ab380903171245n4b11231eo428ecd984c925370@mail.gmail.com> References: <68ca69660903170748s189998f3t678295fc6ac504c@mail.gmail.com> <5f61ab380903171245n4b11231eo428ecd984c925370@mail.gmail.com> Message-ID: <68ca69660903191058p724b5a6dpfe65df07ed38ab39@mail.gmail.com> This is my cluster.conf. I started node1 and everything is ok, i started node2 and when is cheking the fencing node2 restarts node1. I can't understand why. thank you 2009/3/17 Nehemias Jahcob > Tu problema es que estas cruzando la ILO, cada ILO debe ser el fence de > cada maquina.. > > Algo asi. > > Cluster.conf.. > ######################################################################## > > > > > > > > > > > > > > > > > > > > > > login="rhcs" name="node1_ilo" passwd="rhcs_example"/> > login="rhcs" name="node2_ilo" passwd="rhcs_example"/> > > > > restricted="0"> > priority="1"/> > priority="1"/> > > > > > > > > > > > > ########################################################################## > > > Saludos. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Ed.Sanborn at genband.com Thu Mar 19 20:19:16 2009 From: Ed.Sanborn at genband.com (Ed Sanborn) Date: Thu, 19 Mar 2009 16:19:16 -0400 Subject: [Linux-cluster] (no subject) In-Reply-To: <68ca69660903191058p724b5a6dpfe65df07ed38ab39@mail.gmail.com> References: <68ca69660903170748s189998f3t678295fc6ac504c@mail.gmail.com><5f61ab380903171245n4b11231eo428ecd984c925370@mail.gmail.com> <68ca69660903191058p724b5a6dpfe65df07ed38ab39@mail.gmail.com> Message-ID: <593E210EDC38444DA1C17E9E9F5E264B98FB6A@GBMDMail01.genband.com> I am using a RHEL 5.2 cluster. Using GFS. I am see'ing all 8 of my nodes with the following processes taking a lot of cpu time: aisexec cman_tool (multiple instances per node) Do folks see those two processes hogging the cpu normally? All the time?? Ed -------------- next part -------------- An HTML attachment was scrubbed... URL: From janne.peltonen at helsinki.fi Thu Mar 19 20:41:54 2009 From: janne.peltonen at helsinki.fi (Janne Peltonen) Date: Thu, 19 Mar 2009 22:41:54 +0200 Subject: [Linux-cluster] Rejoin blocked after network failure Message-ID: <20090319204153.GA30554@helsinki.fi> Hi! There was an extensive network failure in our network, which stopped the traffic for a couple minutes in both halves of our heartbeat (or, actually, token) network. After the connection was restored, each node refused to let the other nodes rejoin the cluster because they had 'existing state'. What might be going on? rgmanager 2.0.31-1 cman 2.0.84-2 The relevant syslog portion is very long, so I won't post in on the list. It can be found (for a while) at http://www.helsinki.fi/~jmmpelto/tmp/pcn1-messages-existing-state Thanks. -- Janne Peltonen PGP Key ID: 0x9CFAC88B Please consider membership of the Hospitality Club (http://www.hospitalityclub.org) From vu at sivell.com Fri Mar 20 05:16:42 2009 From: vu at sivell.com (vu pham) Date: Thu, 19 Mar 2009 23:16:42 -0600 Subject: [Linux-cluster] quorum disk votes In-Reply-To: References: <49C18E5C.8090307@sivell.com> Message-ID: <49C326BA.80400@sivell.com> Hunt, Gary wrote: > I tried restarting qdiskd and I even tried using luci to remove the node from the cluster and add it back in. I just tried qdiskd and then rgmanager and it looks like it works. I just tested one time on a particular cluster so I am not sure if any other factor on this cluster can make the difference, or if it is the right way to do. Vu > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vu Pham > Sent: Wednesday, March 18, 2009 7:14 PM > To: linux clustering > Subject: Re: [Linux-cluster] quorum disk votes > > > Hunt, Gary wrote: >> Is there a way to get a cluster node to recognize that the number of >> votes a quorum disk gets has changed? I added a new node to the cluster >> and updated the cluster.conf to reflect the changes and propagated it. >> In this case I went from 3 total votes and a quorum disk vote of 1 to 5 >> total votes and quorum disk votes of 2. >> >> >> >> A cman_tool status showed that the total expected votes went to 5, but >> the quorum device votes stayed at 1. A reboot of the node fixes this, >> but would like to do this without risking disruption >> >> > > Have you tried "service qdiskd restart" ? > > Vu > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > IMPORTANT NOTICE: This e-mail message and all attachments, if any, may contain confidential and privileged material and are intended only for the person or entity to which the message is addressed. If you are not an intended recipient, you are hereby notified that any use, dissemination, distribution, disclosure, or copying of this information is unauthorized and strictly prohibited. If you have received this communication in error, please contact the sender immediately by reply e-mail, and destroy all copies of the original message. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Vu Pham Sivell Corporation 7155 Old Katy Rd. Suite 110 South Houston, TX 77024-2136 voice: 713-821-9800 ext 2203 fax: 713-821-9899 From freebsd_china at 163.com Fri Mar 20 09:37:07 2009 From: freebsd_china at 163.com (Lin Wang) Date: Fri, 20 Mar 2009 17:37:07 +0800 Subject: [Linux-cluster] how to use the vmware fencing Message-ID: <001801c9a93f$702b9cf0$027ba8c0@ibmb6cd09cc0b0> Hello everyone; I am try use the vmware server fencing ,but don't find the standard document .who have the use vmware fencing experience . thank you everybody. WangLin 2009-3-20 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jfriesse at redhat.com Fri Mar 20 09:51:43 2009 From: jfriesse at redhat.com (Jan Friesse) Date: Fri, 20 Mar 2009 10:51:43 +0100 Subject: [Linux-cluster] how to use the vmware fencing In-Reply-To: <001801c9a93f$702b9cf0$027ba8c0@ibmb6cd09cc0b0> References: <001801c9a93f$702b9cf0$027ba8c0@ibmb6cd09cc0b0> Message-ID: <49C3672F.2040108@redhat.com> Lin Wang wrote: > Hello everyone; > I am try use the vmware server fencing ,but don't find the standard document .who have the use vmware fencing experience . > thank you everybody. > WangLin > 2009-3-20 > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Lin, vmware fence agent in 5.3 has standard man page (man fence_vmware). I found users have little troubles with configuring this, so I wrote http://sources.redhat.com/cluster/wiki/VMware_FencingConfig . If you still have questions, please ask. Regards, Honza From Santosh.Panigrahi at in.unisys.com Fri Mar 20 10:03:35 2009 From: Santosh.Panigrahi at in.unisys.com (Panigrahi, Santosh Kumar) Date: Fri, 20 Mar 2009 15:33:35 +0530 Subject: [Linux-cluster] how to use the vmware fencing In-Reply-To: <001801c9a93f$702b9cf0$027ba8c0@ibmb6cd09cc0b0> References: <001801c9a93f$702b9cf0$027ba8c0@ibmb6cd09cc0b0> Message-ID: Hello W, I have done fencing in VMware ESX 3.5 environment. Pease refer the old mail thread "[Linux-cluster] help reqd. - VMware ESX 3.5 fencing". After that if you still face any problems then post your queries. Thanks, Santosh From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lin Wang Sent: Friday, March 20, 2009 3:07 PM To: linux clustering Subject: [Linux-cluster] how to use the vmware fencing Hello everyone; I am try use the vmware server fencing ,but don't find the standard document .who have the use vmware fencing experience . thank you everybody. WangLin 2009-3-20 -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank at si.ct.upc.edu Fri Mar 20 11:20:47 2009 From: frank at si.ct.upc.edu (Frank) Date: Fri, 20 Mar 2009 12:20:47 +0100 Subject: [Linux-cluster] processes stalled reading gfs filesystem Message-ID: <49C37C0F.5020308@si.ct.upc.edu> Hi, we have a couple of Dell servers with Red Hat 5.2 and OpenVZ, sharing a GFS filesystem. We have noticed that there are a directory which processes stalls when try to access it. For instance look this processes: [root at parmenides ~]# ps -fel | grep save 4 D root 8997 1 1 78 0 - 1780 339955 09:40 ? 00:02:31 /usr/sbin/save -s espai.upc.es -g Virtuals -LL -f - -m parmenides -t 1236294005 -l 4 -q -W 78 -N /mnt/gfs /mnt/gfs 0 S root 16736 21208 0 78 0 - 980 pipe_w 12:07 pts/1 00:00:00 grep save 4 D root 18796 1 1 78 0 - 1777 339955 08:46 ? 00:02:16 /usr/sbin/save -s espai.upc.es -g Virtuals -LL -f - -m parmenides -t 1236294005 -l 4 -q -W 78 -N /mnt/gfs /mnt/gfs Both processes are stalled reading a file: # lsof -p 8997 | grep gfs save 8997 root cwd DIR 253,7 2048 7022183 /mnt/gfs/vz/private/109/usr/lib/openoffice/program save 8997 root 3r DIR 253,7 3864 26 /mnt/gfs save 8997 root 6r DIR 253,7 3864 232 /mnt/gfs/vz save 8997 root 7r DIR 253,7 3864 233 /mnt/gfs/vz/private save 8997 root 8r DIR 253,7 3864 230761349 /mnt/gfs/vz/private/109 save 8997 root 9r DIR 253,7 3864 230773154 /mnt/gfs/vz/private/109/usr save 8997 root 12r DIR 253,7 2048 7003944 /mnt/gfs/vz/private/109/usr/lib save 8997 root 14r DIR 253,7 3864 7022175 /mnt/gfs/vz/private/109/usr/lib/openoffice # lsof -p 18796 | grep gfs save 18796 root cwd DIR 253,7 2048 7022183 /mnt/gfs/vz/private/109/usr/lib/openoffice/program save 18796 root 3r DIR 253,7 3864 26 /mnt/gfs save 18796 root 6r DIR 253,7 3864 232 /mnt/gfs/vz save 18796 root 7r DIR 253,7 3864 233 /mnt/gfs/vz/private save 18796 root 8r DIR 253,7 3864 230761349 /mnt/gfs/vz/private/109 save 18796 root 9r DIR 253,7 3864 230773154 /mnt/gfs/vz/private/109/usr save 18796 root 12r DIR 253,7 2048 7003944 /mnt/gfs/vz/private/109/usr/lib save 18796 root 14r DIR 253,7 3864 7022175 /mnt/gfs/vz/private/109/usr/lib/openoffice Also there is a process with the glock_ flag accesing the same: 0 D root 8425 6783 0 78 0 - 669 glock_ 08:24 ? 00:00:00 /usr/lib/openoffice/program/pagein -L/usr/lib/openoffice/program @pagein-common What can be the problem? A corruption in the filesystem? should a "gfs_fsck" fix the problem? Regards. Frank -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est? net. For all your IT requirements visit: http://www.transtec.co.uk From marcelosoaressouza at gmail.com Fri Mar 20 11:51:26 2009 From: marcelosoaressouza at gmail.com (Marcelo Souza) Date: Fri, 20 Mar 2009 08:51:26 -0300 Subject: [Linux-cluster] MPICH2 1.0.8 x86_64 (amd64) Package for Debian 5.0 (Lenny) Message-ID: <12c9ca330903200451r7357a951ya57b52697828ecc8@mail.gmail.com> http://www.cebacad.net/files/mpich2_1.0.8_amd64.deb http://www.cebacad.net/files/mpich2_1.0.8_amd64.deb.md5 -- Marcelo Soares Souza http://marcelo.cebacad.net From marcelosoaressouza at gmail.com Fri Mar 20 11:53:27 2009 From: marcelosoaressouza at gmail.com (Marcelo Souza) Date: Fri, 20 Mar 2009 08:53:27 -0300 Subject: [Linux-cluster] OpenMPI 1.3.1 x86_64 (amd64) Package for Debian 5.0 (Lenny) Message-ID: <12c9ca330903200453w635f982cr3e2dd4f2ca5bcd12@mail.gmail.com> http://www.cebacad.net/files/openmpi_1.3.1_amd64.deb http://www.cebacad.net/files/openmpi_1.3.1_amd64.deb.md5 -- Marcelo Soares Souza http://marcelo.cebacad.net From mgrac at redhat.com Fri Mar 20 12:44:27 2009 From: mgrac at redhat.com (=?ISO-8859-1?Q?Marek_=27marx=27_Gr=E1c?=) Date: Fri, 20 Mar 2009 13:44:27 +0100 Subject: [Linux-cluster] Fence Device Question In-Reply-To: <6a90e4da0903181122s13fd4eb5ic3e9383d44da4326@mail.gmail.com> References: <6a90e4da0903181122s13fd4eb5ic3e9383d44da4326@mail.gmail.com> Message-ID: <49C38FAB.50600@redhat.com> Hi, Jon Erickson wrote: > All, > > I currently use the fence_mcdata script (with slight mod) to provide > fencing to my DS-4700M switch. > > I have two questions: > > 1. The username and password are stored plain text within the > cluster.conf file. Is there a way to make this more secure? > (password script?) > > fence_mcdata has: -S Script to run to retrieve login password > 2. fence_mcdata works by making a telnet connection to my switch, > this is also plain text. I know the switch can support SSH. Does > anyone have any expirence using SSH to log into a switch to block > ports? Is there a fence_mcdata_ssh script :). > There is not and I do not have access to this device, but I can rewrite old perl code to our fencing library + python, where ssh will work. But I will need you to test it. Please let me know off-list if you are interested. m, From tiagocruz at forumgdh.net Fri Mar 20 14:45:27 2009 From: tiagocruz at forumgdh.net (Tiago Cruz) Date: Fri, 20 Mar 2009 11:45:27 -0300 Subject: [Linux-cluster] Cluster RHEL under ESX Message-ID: <1237560327.15983.37.camel@tuxkiller> Hello Guys, I have a problem in my environment using RHEL 5.2 machines under a VMware ESX 3.5 - using fence_vmware_ng from Honza. The cluster works as well, the fence works very good too, but, some times we lost the quorum: Mar 19 16:35:21 csd-poa-cla-02 clurgmgrd[2837]: #1: Quorum Dissolved Mar 19 16:35:21 csd-poa-cla-02 openais[2669]: [CLM ] got nodejoin message 10.11.3.50 Mar 19 16:35:21 csd-poa-cla-02 ccsd[2630]: Cluster is not quorate. Refusing connection And we need to reboot all machines to back work. We try to isolate the multicast communication on another interface (eth1) connected on another virtual switch, using a dedicated VLAN, but the problem still happening sometimes, random... Can you suggest some tip for me? Thanks! -- Tiago Cruz http://everlinux.com From Ed.Sanborn at genband.com Fri Mar 20 18:29:26 2009 From: Ed.Sanborn at genband.com (Ed Sanborn) Date: Fri, 20 Mar 2009 14:29:26 -0400 Subject: [Linux-cluster] gfs_controld, aisexec and cman_tool In-Reply-To: <593E210EDC38444DA1C17E9E9F5E264B98FB20@GBMDMail01.genband.com> References: <593E210EDC38444DA1C17E9E9F5E264B98FB1F@GBMDMail01.genband.com> <593E210EDC38444DA1C17E9E9F5E264B98FB20@GBMDMail01.genband.com> Message-ID: <593E210EDC38444DA1C17E9E9F5E264B98FB8E@GBMDMail01.genband.com> Hi folks, I have an 8-node cluster running on an IBM Bladecenter HS21. Using RHEL 5.2, GFS (no GFS2). The nodes are exhibiting high-cpu load with the following apps: aisexec and cman_tool Both these apps race the cpu without any other user apps doing much at all. Affectively, the user experience is dog-slow. After I reboot one of the nodes it clears up, these apps (aisexec and cman_tool)\ seem to behave, for awhile. Eventually they race the cpu again days to weeks later. Has anyone ever experienced this? Top output is below. Thanks, Ed [root at blade1]# top top - 13:47:51 up 40 days, 22:16, 37 users, load average: 4.17, 3.94, 3.86 Tasks: 372 total, 2 running, 369 sleeping, 1 stopped, 0 zombie Cpu(s): 5.9%us, 32.6%sy, 0.0%ni, 61.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8311372k total, 1934844k used, 6376528k free, 76332k buffers Swap: 8388600k total, 322976k used, 8065624k free, 443172k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4352 root RT 0 37404 35m 2020 R 100 0.4 10519:34 aisexec 20806 root 16 0 1684 560 484 S 42 0.0 8324:49 cman_tool 12501 root 15 0 1680 556 484 S 31 0.0 609:38.46 cman_tool 27245 root 16 0 1688 560 484 S 30 0.0 508:14.31 cman_tool 4635 root 34 19 0 0 0 S 2 0.0 1271:52 kipmi0 5047 root 18 0 405m 17m 6260 S 1 0.2 21:57.04 cimserver 28975 root 15 0 2564 1296 900 R 1 0.0 0:00.05 top 1 root 15 0 2064 576 524 S 0 0.0 0:02.91 init 2 root RT -5 0 0 0 S 0 0.0 0:02.98 migration/0 3 root 34 19 0 0 0 S 0 0.0 0:00.11 ksoftirqd/0 4 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0 5 root RT -5 0 0 0 S 0 0.0 0:01.29 migration/1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From corey.kovacs at gmail.com Sat Mar 21 09:16:59 2009 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Sat, 21 Mar 2009 09:16:59 +0000 Subject: [Linux-cluster] gfs_controld, aisexec and cman_tool In-Reply-To: <593E210EDC38444DA1C17E9E9F5E264B98FB8E@GBMDMail01.genband.com> References: <593E210EDC38444DA1C17E9E9F5E264B98FB1F@GBMDMail01.genband.com> <593E210EDC38444DA1C17E9E9F5E264B98FB20@GBMDMail01.genband.com> <593E210EDC38444DA1C17E9E9F5E264B98FB8E@GBMDMail01.genband.com> Message-ID: <4F24E84E-E4C7-4C26-9B6F-B55BF809FF26@gmail.com> Do you have ntp setup? It's possible for the cluster to form without it if the clocks are close enough, but after some skew sets in the cluster deamons work harder to keep in sync. Regards, Corey On Mar 20, 2009, at 18:29, "Ed Sanborn" wrote: > Hi folks, > > > > I have an 8-node cluster running on an IBM Bladecenter HS21. Using > RHEL 5.2, GFS (no GFS2). > > The nodes are exhibiting high-cpu load with the following apps: > > > > aisexec and cman_tool > > > > Both these apps race the cpu without any other user apps doing much > at all. > > Affectively, the user experience is dog-slow. > > After I reboot one of the nodes it clears up, these apps (aisexec > and cman_tool)\ > > seem to behave, for awhile. Eventually they race the cpu again days > to weeks later. > > Has anyone ever experienced this? Top output is below. > > > > Thanks, > > > > Ed > > > > > > [root at blade1]# top > > top - 13:47:51 up 40 days, 22:16, 37 users, load average: 4.17, > 3.94, 3.86 > > Tasks: 372 total, 2 running, 369 sleeping, 1 stopped, 0 zombie > > Cpu(s): 5.9%us, 32.6%sy, 0.0%ni, 61.4%id, 0.0%wa, 0.0%hi, > 0.0%si, 0.0%st > > Mem: 8311372k total, 1934844k used, 6376528k free, 76332k > buffers > > Swap: 8388600k total, 322976k used, 8065624k free, 443172k > cached > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 4352 root RT 0 37404 35m 2020 R 100 0.4 10519:34 aisexec > > 20806 root 16 0 1684 560 484 S 42 0.0 8324:49 cman_tool > > 12501 root 15 0 1680 556 484 S 31 0.0 609:38.46 cman_tool > > 27245 root 16 0 1688 560 484 S 30 0.0 508:14.31 cman_tool > > 4635 root 34 19 0 0 0 S 2 0.0 1271:52 kipmi0 > > 5047 root 18 0 405m 17m 6260 S 1 0.2 21:57.04 cimserver > > 28975 root 15 0 2564 1296 900 R 1 0.0 0:00.05 top > > 1 root 15 0 2064 576 524 S 0 0.0 0:02.91 init > > 2 root RT -5 0 0 0 S 0 0.0 0:02.98 > migration/0 > > 3 root 34 19 0 0 0 S 0 0.0 0:00.11 > ksoftirqd/0 > > 4 root RT -5 0 0 0 S 0 0.0 0:00.00 > watchdog/0 > > 5 root RT -5 0 0 0 S 0 0.0 0:01.29 > migration/1 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From theophanis_kontogiannis at yahoo.gr Sun Mar 22 21:10:37 2009 From: theophanis_kontogiannis at yahoo.gr (Theophanis Kontogiannis) Date: Sun, 22 Mar 2009 23:10:37 +0200 Subject: [Linux-cluster] Can not create a GFS2 filesystem with block size of 512 Message-ID: <029601c9ab32$a7354350$f59fc9f0$@gr> Hello all, I have: Two nodes cluster with exactly the same hardware (4GB RAM, AMD X2, 1TB SATAII disk, 320GB PATA disk) Centos 5.2 2.6.18-92.1.22.el5.centos.plus gfs2-utils-0.1.44-1.el5_2.1 kmod-gfs-0.1.23-5.el5_2.4 I try to create GFS2 on 1TB LV (on the SATA disk) with 'mkfs.gfs2 -b 512 -t tweety:gfs2-01 -p lock_dlm -j 10 /dev/mapper/vg1-data1' I want to use 512 block size. Since the command looked that it last forever, I used -D option to look on what is going on. This visualized the fact that nothing happens after initial effort to create Journal 3. Because I got those freezes during mkfs.gfs2, I used 'vmstat -p /dev/sda1' on a second terminal to observe disk activity. This confirmed that there is no disk activity after the initial effort to create Journal 3. I thought that it might be a problem of how I created the LV so I erase all VG, LV and PV and try to create the file system on the physical device. The creation of GFS2 on the physical partition also freezes on Journal 3. This is the extract of the output of 'mkfs.gfs2 -D -b 512 -t tweety:gfs2-01 -p lock_dlm -j 10 /dev/sda1' (which is same to the output I get when using the LV instead of the physical partition): .......... ri_addr: 1951423390 ri_length: 269 ri_data0: 1951423659 ri_data: 523884 ri_bitbytes: 130971 ri_addr: 1951947543 ri_length: 269 ri_data0: 1951947812 ri_data: 523884 ri_bitbytes: 130971 ri_addr: 1952471696 ri_length: 269 ri_data0: 1952471965 ri_data: 523884 ri_bitbytes: 130971 ri_addr: 1952995849 ri_length: 269 ri_data0: 1952996118 ri_data: 523884 ri_bitbytes: 130971 Root directory: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 1 no_addr: 399 di_mode: 040755 di_uid: 0 di_gid: 0 di_nlink: 2 di_size: 280 di_blocks: 1 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 399 di_goal_data: 399 di_flags: 0x00000001 di_payload_format: 1200 di_height: 0 di_depth: 0 di_entries: 2 di_eattr: 0 Master dir: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 2 no_addr: 400 di_mode: 040755 di_uid: 0 di_gid: 0 di_nlink: 2 di_size: 280 di_blocks: 1 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 400 di_goal_data: 400 di_flags: 0x00000201 di_payload_format: 1200 di_height: 0 di_depth: 0 di_entries: 2 di_eattr: 0 Super Block: mh_magic: 0x01161970 mh_type: 1 mh_format: 100 sb_fs_format: 1801 sb_multihost_format: 1900 sb_bsize: 512 sb_bsize_shift: 9 no_formal_ino: 2 no_addr: 400 no_formal_ino: 1 no_addr: 399 sb_lockproto: lock_dlm sb_locktable: tweety:gfs2-01 Journal 0: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 4 no_addr: 402 di_mode: 0100600 di_uid: 0 di_gid: 0 di_nlink: 1 di_size: 134217728 di_blocks: 266516 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 4773 di_goal_data: 266917 di_flags: 0x00000200 di_payload_format: 0 di_height: 4 di_depth: 0 di_entries: 0 di_eattr: 0 Journal 1: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 5 no_addr: 266918 di_mode: 0100600 di_uid: 0 di_gid: 0 di_nlink: 1 di_size: 134217728 di_blocks: 266516 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 271289 di_goal_data: 533703 di_flags: 0x00000200 di_payload_format: 0 di_height: 4 di_depth: 0 di_entries: 0 di_eattr: 0 Journal 2: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 6 no_addr: 533704 di_mode: 0100600 di_uid: 0 di_gid: 0 di_nlink: 1 di_size: 134217728 di_blocks: 266516 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 538075 di_goal_data: 800219 di_flags: 0x00000200 di_payload_format: 0 di_height: 4 di_depth: 0 di_entries: 0 di_eattr: 0 Journal 3: After that I have no disk activity and no logging. There is no message from the kernel. Anyone knows the reason for that behavior, and what is the minimum block size I can use (I have tested 1024 and it works fine)? Thank you all for your time. Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.sturm at eprize.com Mon Mar 23 04:31:35 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Mon, 23 Mar 2009 00:31:35 -0400 Subject: [Linux-cluster] Can not create a GFS2 filesystem with block size of512 In-Reply-To: <029601c9ab32$a7354350$f59fc9f0$@gr> References: <029601c9ab32$a7354350$f59fc9f0$@gr> Message-ID: <64D0546C5EBBD147B75DE133D798665F021BA594@hugo.eprize.local> Don't use GFS2 on CentOS 5.2. It simply isn't ready. You'll want to setup either GFS1 instead, or wait for CentOS 5.3. ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Theophanis Kontogiannis Sent: Sunday, March 22, 2009 5:11 PM To: 'linux clustering' Subject: [Linux-cluster] Can not create a GFS2 filesystem with block size of512 Hello all, I have: Two nodes cluster with exactly the same hardware (4GB RAM, AMD X2, 1TB SATAII disk, 320GB PATA disk) Centos 5.2 2.6.18-92.1.22.el5.centos.plus gfs2-utils-0.1.44-1.el5_2.1 kmod-gfs-0.1.23-5.el5_2.4 I try to create GFS2 on 1TB LV (on the SATA disk) with 'mkfs.gfs2 -b 512 -t tweety:gfs2-01 -p lock_dlm -j 10 /dev/mapper/vg1-data1' I want to use 512 block size. Since the command looked that it last forever, I used -D option to look on what is going on. This visualized the fact that nothing happens after initial effort to create Journal 3. Because I got those freezes during mkfs.gfs2, I used 'vmstat -p /dev/sda1' on a second terminal to observe disk activity. This confirmed that there is no disk activity after the initial effort to create Journal 3. I thought that it might be a problem of how I created the LV so I erase all VG, LV and PV and try to create the file system on the physical device. The creation of GFS2 on the physical partition also freezes on Journal 3. This is the extract of the output of 'mkfs.gfs2 -D -b 512 -t tweety:gfs2-01 -p lock_dlm -j 10 /dev/sda1' (which is same to the output I get when using the LV instead of the physical partition): ........................... ri_addr: 1951423390 ri_length: 269 ri_data0: 1951423659 ri_data: 523884 ri_bitbytes: 130971 ri_addr: 1951947543 ri_length: 269 ri_data0: 1951947812 ri_data: 523884 ri_bitbytes: 130971 ri_addr: 1952471696 ri_length: 269 ri_data0: 1952471965 ri_data: 523884 ri_bitbytes: 130971 ri_addr: 1952995849 ri_length: 269 ri_data0: 1952996118 ri_data: 523884 ri_bitbytes: 130971 Root directory: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 1 no_addr: 399 di_mode: 040755 di_uid: 0 di_gid: 0 di_nlink: 2 di_size: 280 di_blocks: 1 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 399 di_goal_data: 399 di_flags: 0x00000001 di_payload_format: 1200 di_height: 0 di_depth: 0 di_entries: 2 di_eattr: 0 Master dir: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 2 no_addr: 400 di_mode: 040755 di_uid: 0 di_gid: 0 di_nlink: 2 di_size: 280 di_blocks: 1 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 400 di_goal_data: 400 di_flags: 0x00000201 di_payload_format: 1200 di_height: 0 di_depth: 0 di_entries: 2 di_eattr: 0 Super Block: mh_magic: 0x01161970 mh_type: 1 mh_format: 100 sb_fs_format: 1801 sb_multihost_format: 1900 sb_bsize: 512 sb_bsize_shift: 9 no_formal_ino: 2 no_addr: 400 no_formal_ino: 1 no_addr: 399 sb_lockproto: lock_dlm sb_locktable: tweety:gfs2-01 Journal 0: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 4 no_addr: 402 di_mode: 0100600 di_uid: 0 di_gid: 0 di_nlink: 1 di_size: 134217728 di_blocks: 266516 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 4773 di_goal_data: 266917 di_flags: 0x00000200 di_payload_format: 0 di_height: 4 di_depth: 0 di_entries: 0 di_eattr: 0 Journal 1: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 5 no_addr: 266918 di_mode: 0100600 di_uid: 0 di_gid: 0 di_nlink: 1 di_size: 134217728 di_blocks: 266516 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 271289 di_goal_data: 533703 di_flags: 0x00000200 di_payload_format: 0 di_height: 4 di_depth: 0 di_entries: 0 di_eattr: 0 Journal 2: mh_magic: 0x01161970 mh_type: 4 mh_format: 400 no_formal_ino: 6 no_addr: 533704 di_mode: 0100600 di_uid: 0 di_gid: 0 di_nlink: 1 di_size: 134217728 di_blocks: 266516 di_atime: 1237752111 di_mtime: 1237752111 di_ctime: 1237752111 di_major: 0 di_minor: 0 di_goal_meta: 538075 di_goal_data: 800219 di_flags: 0x00000200 di_payload_format: 0 di_height: 4 di_depth: 0 di_entries: 0 di_eattr: 0 Journal 3: After that I have no disk activity and no logging. There is no message from the kernel. Anyone knows the reason for that behavior, and what is the minimum block size I can use (I have tested 1024 and it works fine)? Thank you all for your time. Theophanis Kontogiannis -------------- next part -------------- An HTML attachment was scrubbed... URL: From robejrm at gmail.com Mon Mar 23 10:55:33 2009 From: robejrm at gmail.com (Juan Ramon Martin Blanco) Date: Mon, 23 Mar 2009 11:55:33 +0100 Subject: [Linux-cluster] Weird machine load average when using GFS2 Message-ID: <8a5668960903230355s5e058e2fn11fe1e1c41725f38@mail.gmail.com> Hi all, I'm having a strange issue with a machine included in a two-node cluster. I have several gfs2 filesystems that reside on top of clustered logical volumes in a DAS shared by the two nodes. In machine 2, each time I mount a gfs2 fs, load is incremented by 1, and decremented if I umount it. This machine is actually not very loaded, load should be around 0.1 Machine 1 is not affected, load does not change when I mount/umount [root at machine2 ~]# gfs2_tool list 253:15 MYCLUSTER:gfs201 253:14 MYCLUSTER:gfs202 253:39 MYCLUSTER:gfs203 253:38 MYCLUSTER:gfs204 253:49 MYCLUSTER:gfs205 253:33 MYCLUSTER:gfs206 253:28 MYCLUSTER:gfs207 253:24 MYCLUSTER:gfs208 253:55 MYCLUSTER:gfs209 [root at machine2 ~]# w 11:29:51 up 1 day, 23:39, 1 user, load average: 9,00, 9,00, 8,63 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 x.x.x.x 11:19 0.00s 0.02s 0.00s w If I umount two of the filesystems: [root at machine2 ~]# umount /my/mount/point/05 [root at machine2 ~]# umount /my/mount/point/06 [root at machine2 ~]# gfs2_tool list 253:15 MYCLUSTER:gfs201 253:14 MYCLUSTER:gfs202 253:39 MYCLUSTER:gfs203 253:38 MYCLUSTER:gfs204 253:28 MYCLUSTER:gfs207 253:24 MYCLUSTER:gfs208 253:55 MYCLUSTER:gfs209 Load is an average, so waiting a few minutes: [root at machine2 ~]# w 11:36:07 up 1 day, 23:45, 1 user, load average: 7,35, 8,20, 8,41 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT root pts/0 x.x.x.x 11:19 0.00s 0.04s 0.04s -bash And machine one is not having the same behaviour: [root at machine1 ~]# w 11:36:45 up 1 day, 23:37, 9 users, load average: 0,01, 0,04, 0,14 And it has mounted more filesystems: [root at machine1 ~]# gfs2_tool list | wc -l 26 Both machines are rhel5.3 with the latest updates applied: Linux machine1 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux Linux machine2 2.6.18-128.1.1.el5 #1 SMP Mon Jan 26 13:58:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux cman-2.0.98-1.el5 lvm2-cluster-2.02.40-7.el5 gfs2-utils-0.1.53-1.el5 Any clue? Greetings, Juan Ram?n Mart?n -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kaerka.Phillips at USPTO.GOV Mon Mar 23 15:57:37 2009 From: Kaerka.Phillips at USPTO.GOV (Phillips, Kaerka) Date: Mon, 23 Mar 2009 11:57:37 -0400 Subject: [Linux-cluster] Recommended fence timeouts Message-ID: Hi- I was wondering best practices for a 4-node cluster using Dell R905's and the fence_drac function. The server which is controlling the cluster keeps attempting to fence too fast and hanging the drac5 cards (amongst other issues). Each R905 has quad-quad core AMD Opterons and 128gb of ram. This is the default which I'm using now: And the openais config: Thanks, Kaerka Phillips Unix Systems Services Section USPTO - 3C21 571-272-6443 -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Tue Mar 24 09:39:03 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 24 Mar 2009 10:39:03 +0100 Subject: [Linux-cluster] Cluster 3.0.0.rc1 release Message-ID: <1237887543.23013.57.camel@cerberus.int.fabbione.net> he cluster team and its community are proud to announce the 3.0.0.rc1 release candidate from the STABLE3 branch. The development cycle for 3.0.0 is completed. The STABLE3 branch is now collecting only bug fixes and minimal update required to build and run on top of the latest upstream kernel/corosync/openais. Everybody with test equipment and time to spare, is highly encouraged to download, install and test this release candidate and more important report problems. This is the time for people to make a difference and help us testing as much as possible. In order to build the 3.0.0.rc1 release you will need: - corosync 0.95 - openais 0.94 - linux kernel 2.6.28.x (requires the latest release from the 2.6.28.x stable release for gfs-kernel module) The new source tarball can be downloaded here: ftp://sources.redhat.com/pub/cluster/releases/cluster-3.0.0.rc1.tar.gz https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.0.rc1.tar.gz At the same location is now possible to find separated tarballs for fence-agents and resource-agents as previously announced (http://www.redhat.com/archives/cluster-devel/2009-February/msg00003.htm) To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Happy clustering, Fabio Under the hood (from 3.0.0.beta1): Abhijith Das (1): gfs2_tool, libgfs2: bz487608 - GFS2: gfs2_tool unfreeze hangs Andrew Price (4): gfs2-utils: Remove 'die' calls from mount_gfs2_meta libgfs2: Remove 'die' calls from __get_sysfs libgfs2: Clean up mp2fsname2 libgfs2: Remove 'die' from mp2fsname and find_debugfs_mount Christine Caulfield (4): config: make confdb2ldap compile with latest corosync config: config_xml doesn't need to include logging.h cman: Fix crash caused by cman_tool version -r0 cman: Tidy up cman sockets when we exit David Teigland (11): fence_tool: dump only the size of the buffer gfs_controld: dlm_controld failure is cluster_dead gfs_controld: wait for dlm registration dlm_controld/gfs_controld: default to no plock rate limit dlm_tool: fix lockdebug parsing of first_lkid libdlm: fix warning fenced: fix accounting for startup fencing victims groupd/fenced/dlm_controld/gfs_controld: fix lockfile fence_tool, init.d/cman: fix wait/retry options dlm_tool: show waiters in lockdebug output groupd/fenced/dlm_controld/gfs_controld: unlink lockfile Fabio M. Di Nitto (15): config: time to say goodbye to ccsd config: fix build warnings introduced by corosync changes cman: fix build warnings introduced by corosync changes build: drop mibdir as it's now unrequired cman: fix more uint -> hdb_handle_t breakage build: drop xenlibs info notifyd: fix lockfile config: stop linking against compat libs rgmanager: stop linking clulog with cman init: merge qdisk init script into cman init cman init: drop legacy option init: cmannotifyd failure should not exit notifyd: fix lockfile handling and exit path init: fix status check config: kill dead code Jan Friesse (7): fence_intelmodular: Rewrite of agent under Python unified library fence_ibmblade: Rewrite of agent under Python unified library fence_apc_snmp: Rewrite of agent under Python unified library fence_ipmilan: Added list, monitor and metadata operations fence_ipmilan: Fix metadata generation fence_snmp_*: Add missed sys.path.apend to SNMP agents fence: Fix configure script (hardcoded fenceagentslibdir path) Lon Hohberger (15): rgmanager: Fix enable-while-frozen breakage rgmanager: Fix ip start phase with monitor_link="0" fence: Make fence agent metadata valid XML config: Make tag optional these days. config: Include 'port' as cman option config: Can't actually disable_openais from cluster.conf rgmanager: Fix VM restart issue rgmanager: fix rare segfault in -USR1 dump code rgmanager: Block signals in worker threads rgmanager: Clean up thread handling rgmanager: Fix build issue fence: Fix fence_xvm[d] man pages qdisk: Make mixed-endian work qdiskd: Remove pid file on clean/normal exit rgmanager: Remove pid file on clean/normal exit Marek 'marx' Grac (3): fence_agents: #487501 - Exceptions in fencing agents fence_agents: #487501 - Exceptions in fencing agents fence_apc: #491640 - APC Fence Agent does not work with non-admin account Steven Whitehouse (1): Remove unused code from various places cman/daemon/cman-preconfig.c | 26 +- cman/daemon/commands.c | 7 +- cman/daemon/nodelist.h | 10 +- cman/init.d/Makefile | 19 +- cman/init.d/cman.in | 174 ++-- cman/init.d/qdiskd.in | 123 -- cman/notifyd/main.c | 16 +- cman/qdisk/daemon_init.c | 16 +- cman/qdisk/disk.c | 45 +- cman/qdisk/main.c | 16 +- cman/qdisk/mkqdisk.c | 7 +- cman/qdisk/proc.c | 7 +- config/Makefile | 2 +- config/daemons/Makefile | 8 - config/daemons/ccsd/Makefile | 37 - config/daemons/ccsd/ccsd.c | 905 --------------- config/daemons/ccsd/cluster_mgr.c | 688 ----------- config/daemons/ccsd/cluster_mgr.h | 6 - config/daemons/ccsd/cnx_mgr.c | 1399 ----------------------- config/daemons/ccsd/cnx_mgr.h | 7 - config/daemons/ccsd/comm_headers.h | 48 - config/daemons/ccsd/debug.h | 9 - config/daemons/ccsd/globals.c | 19 - config/daemons/ccsd/globals.h | 23 - config/daemons/ccsd/misc.c | 388 ------- config/daemons/ccsd/misc.h | 19 - config/daemons/man/Makefile | 9 - config/daemons/man/ccsd.8 | 74 -- config/libs/Makefile | 3 - config/libs/libccscompat/Makefile | 15 - config/libs/libccscompat/libccscompat.c | 752 ------------ config/libs/libccscompat/libccscompat.h | 18 - config/libs/libccsconfdb/ccs.h | 4 - config/libs/libccsconfdb/libccs.c | 71 -- config/man/Makefile | 3 +- config/man/ccs.7 | 22 - config/plugins/Makefile | 3 - config/plugins/ccsais/Makefile | 33 - config/plugins/ccsais/config.c | 236 ---- config/plugins/ldap/configldap.c | 4 +- config/plugins/xml/config.c | 14 +- config/tools/ccs_tool/Makefile | 18 +- config/tools/ccs_tool/ccs_tool.c | 57 - config/tools/ccs_tool/editconf.c | 14 - config/tools/ccs_tool/update.c | 673 ----------- config/tools/ccs_tool/update.h | 6 - config/tools/ldap/confdb2ldif.c | 33 +- config/tools/man/Makefile | 4 - config/tools/man/ccs_test.8 | 132 --- config/tools/xml/cluster.rng | 63 +- configure | 25 +- dlm/libdlm/libdlm.c | 8 +- dlm/tool/main.c | 142 +++- fence/agents/apc/fence_apc.py | 33 +- fence/agents/apc_snmp/Makefile | 1 - fence/agents/apc_snmp/fence_apc_snmp.py | 625 +++-------- fence/agents/bladecenter/fence_bladecenter.py | 5 +- fence/agents/cisco_mds/fence_cisco_mds.py | 1 + fence/agents/ibmblade/fence_ibmblade.pl | 273 ----- fence/agents/ibmblade/fence_ibmblade.py | 76 ++ fence/agents/ifmib/fence_ifmib.py | 1 + fence/agents/ilo/fence_ilo.py | 2 + fence/agents/intelmodular/Makefile | 4 + fence/agents/intelmodular/fence_intelmodular.py | 87 ++ fence/agents/ipmilan/ipmilan.c | 79 ++- fence/agents/lib/fencing.py.py | 7 +- fence/agents/lpar/fence_lpar.py | 12 +- fence/agents/xvm/options.c | 3 + fence/fence_tool/fence_tool.c | 337 ++++--- fence/fenced/cpg.c | 2 +- fence/fenced/fd.h | 1 + fence/fenced/main.c | 3 +- fence/fenced/recover.c | 5 + fence/man/Makefile | 4 +- fence/man/fence_apc_snmp.8 | 139 +++ fence/man/fence_ibmblade.8 | 111 ++- fence/man/fence_intelmodular.8 | 131 +++ fence/man/fence_xvm.8 | 24 +- fence/man/fence_xvmd.8 | 32 +- gfs2/libgfs2/libgfs2.h | 5 +- gfs2/libgfs2/misc.c | 151 ++- gfs2/mkfs/main.c | 9 +- gfs2/mkfs/main_grow.c | 50 +- gfs2/mkfs/main_jadd.c | 139 ++-- gfs2/mkfs/main_mkfs.c | 157 ++-- gfs2/quota/check.c | 20 +- gfs2/quota/main.c | 34 +- gfs2/tool/df.c | 55 +- gfs2/tool/misc.c | 29 +- gfs2/tool/tune.c | 20 +- group/daemon/main.c | 3 +- group/dlm_controld/config.h | 2 +- group/dlm_controld/main.c | 3 +- group/gfs_controld/config.h | 2 +- group/gfs_controld/cpg-new.c | 15 +- group/gfs_controld/gfs_daemon.h | 1 + group/gfs_controld/main.c | 12 +- make/defines.mk.input | 3 - make/fencebuild.mk | 1 - make/install.mk | 4 - make/uninstall.mk | 3 - rgmanager/include/res-ocf.h | 1 + rgmanager/include/reslist.h | 1 + rgmanager/src/clulib/daemon_init.c | 19 +- rgmanager/src/clulib/tmgr.c | 61 +- rgmanager/src/daemons/groups.c | 3 + rgmanager/src/daemons/main.c | 2 + rgmanager/src/daemons/restree.c | 28 +- rgmanager/src/daemons/rg_state.c | 4 +- rgmanager/src/daemons/rg_thread.c | 2 +- rgmanager/src/resources/ip.sh | 35 +- rgmanager/src/utils/Makefile | 2 +- 112 files changed, 2023 insertions(+), 7311 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From carlopmart at gmail.com Tue Mar 24 15:24:27 2009 From: carlopmart at gmail.com (carlopmart) Date: Tue, 24 Mar 2009 16:24:27 +0100 Subject: [Linux-cluster] Change clustername on a gfs2 filesystem Message-ID: <49C8FB2B.9020201@gmail.com> Hi all, Somebody knows how can I chage clustername used when disk or volume is formatted?? I have changed clustername on cluster.conf for three nodes and I need to mount old gfs2 filesystem with the new clustername. Many thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From jumanjiman at gmail.com Tue Mar 24 15:27:18 2009 From: jumanjiman at gmail.com (jumanjiman at gmail.com) Date: Tue, 24 Mar 2009 15:27:18 +0000 Subject: [Linux-cluster] Change clustername on a gfs2 filesystem In-Reply-To: <49C8FB2B.9020201@gmail.com> References: <49C8FB2B.9020201@gmail.com> Message-ID: <1620312237-1237908448-cardhu_decombobulator_blackberry.rim.net-1348359012-@bxe1288.bisx.prod.on.blackberry> Use 'gfs2_tool sb' to edit the table name in the superblock? -paul Sent via BlackBerry by AT&T -----Original Message----- From: carlopmart Date: Tue, 24 Mar 2009 16:24:27 To: Subject: [Linux-cluster] Change clustername on a gfs2 filesystem Hi all, Somebody knows how can I chage clustername used when disk or volume is formatted?? I have changed clustername on cluster.conf for three nodes and I need to mount old gfs2 filesystem with the new clustername. Many thanks. -- CL Martinez carlopmart {at} gmail {d0t} com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From alan.zg at gmail.com Tue Mar 24 20:24:55 2009 From: alan.zg at gmail.com (Alan A) Date: Tue, 24 Mar 2009 15:24:55 -0500 Subject: [Linux-cluster] fence_apc works while fence_node does not Message-ID: I opened RHEL suppport ticket - I am just wondering if anyone knows what is the status of this BUG/fix - ticket 1896886. We use APC power switches for fencing - when fence_node call is made this is the output: agent "fence_apc" reports: Traceback (most recent call last): File "/sbin/fence_apc", line 207, in ? main() File "/sbin/fence_apc", line 191, in main fence_action(conn, options, set_power_status, get_power_status) File "/usr/lib/fence/fencing.py", line 355, in fence_a agent "fence_apc" reports: ction status = get_power_fn(tn, options) File "/sbin/fence_apc", line 82, in get_power_status status = re.compile("\s*"+options["-n"]+"-.*(ON|OFF)", re.IGNORECASE).search(result).group(1) AttributeError: 'NoneType' object has no attribute 'group agent "fence_apc" reports: ' -- Alan A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank at si.ct.upc.edu Wed Mar 25 14:14:14 2009 From: frank at si.ct.upc.edu (Frank) Date: Wed, 25 Mar 2009 15:14:14 +0100 (CET) Subject: [Linux-cluster] Re: processes stalled reading gfs filesystem In-Reply-To: <20090320124527.277D961A120@hormel.redhat.com> References: <20090320124527.277D961A120@hormel.redhat.com> Message-ID: <46576.147.83.83.59.1237990454.squirrel@imap-ct.upc.es> Hi again, I haven't received any answer but I keep on giving details about this issue. Finally I umount GFS filesystem in both nodes and I have done a gfs_fsck;it have fix several filesystem elements. After I have mount it, and when we try to work on previously damaged directories we get those messages: GFS: fsid=hr-pm:gfs01.0: warning: assertion "(gh->gh_flags & LM_FLAG_ANY) || !(tmp_gh->gh_flags & LM_FLAG_ANY)" failed GFS: fsid=hr-pm:gfs01.0: function = add_to_queue GFS: fsid=hr-pm:gfs01.0: file = fs/gfs/glock.c, line = 1420 GFS: fsid=hr-pm:gfs01.0: time = 1237984594 BUG: warning at fs/gfs/util.c:287/gfs_assert_warn_i() (Tainted: P ) [] gfs_assert_warn_i+0x92/0xbd [gfs] [] gfs_glock_nq+0x131/0x36f [gfs] [] gfs_glock_nq_init+0x13/0x26 [gfs] [] gfs_private_nopage+0x45/0x81 [gfs] [] __handle_mm_fault+0x23b/0xe08 [] __do_page_cache_readahead+0x1ab/0x1cc [] do_page_fault+0x2a4/0x5ad [] do_page_fault+0x0/0x5ad [] error_code+0x4f/0x54 [] __inet6_check_established+0x21f/0x394 Any ideas? Thanks. Frank > Date: Fri, 20 Mar 2009 12:20:47 +0100 > From: Frank > Subject: [Linux-cluster] processes stalled reading gfs filesystem > To: linux-cluster at redhat.com > Message-ID: <49C37C0F.5020308 at si.ct.upc.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi, > we have a couple of Dell servers with Red Hat 5.2 and OpenVZ, sharing a > GFS filesystem. > > We have noticed that there are a directory which processes stalls when > try to access it. > For instance look this processes: > > [root at parmenides ~]# ps -fel | grep save > 4 D root 8997 1 1 78 0 - 1780 339955 09:40 ? > 00:02:31 /usr/sbin/save -s espai.upc.es -g Virtuals -LL -f - -m > parmenides -t 1236294005 -l 4 -q -W 78 -N /mnt/gfs /mnt/gfs > 0 S root 16736 21208 0 78 0 - 980 pipe_w 12:07 pts/1 > 00:00:00 grep save > 4 D root 18796 1 1 78 0 - 1777 339955 08:46 ? > 00:02:16 /usr/sbin/save -s espai.upc.es -g Virtuals -LL -f - -m > parmenides -t 1236294005 -l 4 -q -W 78 -N /mnt/gfs /mnt/gfs > > Both processes are stalled reading a file: > > # lsof -p 8997 | grep gfs > save 8997 root cwd DIR 253,7 2048 7022183 > /mnt/gfs/vz/private/109/usr/lib/openoffice/program > save 8997 root 3r DIR 253,7 3864 26 /mnt/gfs > save 8997 root 6r DIR 253,7 3864 232 /mnt/gfs/vz > save 8997 root 7r DIR 253,7 3864 233 > /mnt/gfs/vz/private > save 8997 root 8r DIR 253,7 3864 230761349 > /mnt/gfs/vz/private/109 > save 8997 root 9r DIR 253,7 3864 230773154 > /mnt/gfs/vz/private/109/usr > save 8997 root 12r DIR 253,7 2048 7003944 > /mnt/gfs/vz/private/109/usr/lib > save 8997 root 14r DIR 253,7 3864 7022175 > /mnt/gfs/vz/private/109/usr/lib/openoffice > > # lsof -p 18796 | grep gfs > save 18796 root cwd DIR 253,7 2048 7022183 > /mnt/gfs/vz/private/109/usr/lib/openoffice/program > save 18796 root 3r DIR 253,7 3864 26 /mnt/gfs > save 18796 root 6r DIR 253,7 3864 232 /mnt/gfs/vz > save 18796 root 7r DIR 253,7 3864 233 > /mnt/gfs/vz/private > save 18796 root 8r DIR 253,7 3864 230761349 > /mnt/gfs/vz/private/109 > save 18796 root 9r DIR 253,7 3864 230773154 > /mnt/gfs/vz/private/109/usr > save 18796 root 12r DIR 253,7 2048 7003944 > /mnt/gfs/vz/private/109/usr/lib > save 18796 root 14r DIR 253,7 3864 7022175 > /mnt/gfs/vz/private/109/usr/lib/openoffice > > Also there is a process with the glock_ flag accesing the same: > > 0 D root 8425 6783 0 78 0 - 669 glock_ 08:24 ? > 00:00:00 /usr/lib/openoffice/program/pagein > -L/usr/lib/openoffice/program @pagein-common > > What can be the problem? A corruption in the filesystem? > should a "gfs_fsck" fix the problem? > Regards. > > Frank -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est? net. For all your IT requirements visit: http://www.transtec.co.uk From haprapp at gmail.com Wed Mar 25 23:20:26 2009 From: haprapp at gmail.com (Hari Prakash) Date: Wed, 25 Mar 2009 23:20:26 +0000 Subject: [Linux-cluster] Hari's Calendar Message-ID: Hi Please click on the link below and enter your birthday for me. I am creating a birthday calendar for myself. Don't worry, it'll take less than a minute (and you don't have to enter your year of birth). http://www.birthdayalarm.com/bd2/84884185a223481172b1462676858c440280226d1386 Hari From fernando at lozano.eti.br Thu Mar 26 19:07:21 2009 From: fernando at lozano.eti.br (fernando at lozano.eti.br) Date: Thu, 26 Mar 2009 16:07:21 -0300 Subject: [Linux-cluster] rhcs x iptables Message-ID: <49cbd269.a2.1b3e.193823380@lozano.eti.br> Hi there, I have a Fedora 10 system with two KVM virtual machines, both running RHEL 5.2 and RHCS. The intent is to prototype a cluster configuration for a customer. The problem is, everything is fine unless I start iptables on the VMs. But it's unacceptable to run the cluster without am OS-level firewall. The ports list on rhcs manuals, on the cluster project wiki, and what I observe using netstat do not agree. None of them talks about port 5149 which I observe being opened by aisexec (cman). And I don't see any use of ports 41966 through 41968 which are supposed to be opened my rgmanager or 5404 by cman. But even after I changed my iptables config to open all ports, I still canot relocate or failover services between nodes. I configured apache as a script service to play with cluster administration. My vms are on the default KVM network, 192.168.122./24. It's very strange system-config-cluster on node 1 shows both nodes (cs1 and cs2) joined the cluster and starts my teste-httpd service, but node 2 doesn't show the status of any cluster service (on system-config-cluster). If I try to use clusvnadm to relocate the service from cs1 to cs2, it hangs. And I can't stop rgmanager with iptables enabled. Flushing iptables doesn't help when cman and rgmanager were started with iptables on. Attached are my cluster.conf, /etc/sysconfig/iptables and netstat -anp []s, Fernando Lozano -------------- next part -------------- A non-text attachment was scrubbed... Name: iptables Type: application/octet-stream Size: 2019 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1191 bytes Desc: not available URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: netstat-tudo.txt URL: From jds at techma.com Thu Mar 26 22:00:19 2009 From: jds at techma.com (Simmons, Dan A) Date: Thu, 26 Mar 2009 18:00:19 -0400 Subject: [Linux-cluster] rhel4u7 gfs locking up - unable to obtain cluster lock Message-ID: <35440D75A88EF04C81FDEC03F4F739F7028A28@tmaemail.techma.com> Hi All, I have a production Redhat 4u7 GFS cluster that has locked up 5 times in the last week. The cluster consists of 12 nodes. 3 of the nodes run Oracle RAC and the rest run home grown applications. The system has heavy read/write to the shared gfs disks. The symptoms seem similar to those described in bugzilla 247766 -- my cluster locks up and I am unable to do anything except reboot the entire cluster. Prior to the system locking up I get an error in /var/log/messages "unable to obtain cluster lock: connection timed out" on one of the nodes but nothing else appears in the logs. There are 4 gfs volumes. The current stats from the busiest volume are: locks 68763 locks held 33981 incore inodes 33778 metadata buffers 210 unlinked inodes 0 quota IDs 5 incore log buffers 0 log space used 0.34% meta header cache entries 0 glock dependencies 0 glocks on reclaim list 0 log wraps 17 outstanding LM calls 0 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 41083300 glock nq calls 39290298 glock dq calls 26025821 glock prefetch calls 34071947 lm_lock calls 54069538 lm_unlock calls 40805646 lm callbacks 94932089 address operations 2335588 dentry operations 4683578 export operations 0 file operations 3179652 inode operations 9595976 super operations 39907494 vm operations 0 block I/O reads 34785108 block I/O writes 344510 I would be grateful for any advice, especially regarding locks and tuning. I am tempted to set the glock_purge to 50 as described as a fix for the RHEL4u4 locking problem but worry that this might screw things up worse. The specifics for the system are as follows: Rhel4u7 smp kernel 2.6.9-78.0.1 gfs 6.1.18-1 gfs-kernel-smp-2-6-9-80.9 rgmanager 1.9.80-1 cman 1.0.24-1 ccs 1.0.12-1 magma 1.0.8-1 magma-plugin 1.0.14-1 lvm2-cluster 2.02.37-3 fence 1.32.63-1 J. Dan From kadlec at mail.kfki.hu Thu Mar 26 22:47:00 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Thu, 26 Mar 2009 23:47:00 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 Message-ID: Hi, Freshly built cluster-2.03.11 reproducibly freezes as mailman started. The versions are: linux-2.6.27.21 cluster-2.03.11 openais from svn, subrev 1152 version 0.80 LVM2.2.02.44 This is a five node cluster wich was just upgraded from cluster-2.01.00, node by node. All nodes went fine except when the last one, which runs the mailman queue manager was upgraded: after the upgrade as the manager is started, the system freezes completely. No error message in the screen or in the kernel log. The system responds to ping, that's all, but nothing can be done at the console except rebooting. Usually when this node is fenced off, shortly after the fencing node freezes as well. What I could find in the kernel log of this second machine is as follows: Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1 Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to acquire journal lock... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Trying to acquire journal lock... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Looking at journal... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Looking at journal... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Acquiring the transaction lock... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Acquiring the transaction lock... Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Replaying journal... Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Replaying journal... Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Replayed 65 of 85 blocks Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: replays = 65, skips = 12, sames = 8 Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Replayed 888 of 994 blocks Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: replays = 888, skips = 66, sames = 40 Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Journal replayed in 1s Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Done Does it indicate anything, which could help to fix the cluster? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From kadlec at mail.kfki.hu Fri Mar 27 01:03:18 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 02:03:18 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: Message-ID: On Thu, 26 Mar 2009, Kadlecsik Jozsef wrote: > Freshly built cluster-2.03.11 reproducibly freezes as mailman started. [...] Of course all the mailman data is over GFS: list config files, locks, queues, archives, etc. When the system is frozen, nothing can be obtained by the magic sysreq keys, just the command is echoed (if the screen was not blank before, otherwise it remains the same). Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From ben.yarwood at juno.co.uk Fri Mar 27 02:21:01 2009 From: ben.yarwood at juno.co.uk (Ben Yarwood) Date: Fri, 27 Mar 2009 02:21:01 -0000 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: Message-ID: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> Replaying a journal as below usually idicates a node has withdrawn from that file system I believe. You should grep messages on all nodes for 'GFS', if any node is repoting errors with this fs then it will need rebooting/fencing before access to that fs can be achieved. Ben -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kadlecsik Jozsef Sent: 26 March 2009 22:47 To: linux clustering Subject: [Linux-cluster] Freeze with cluster-2.03.11 Hi, Freshly built cluster-2.03.11 reproducibly freezes as mailman started. The versions are: linux-2.6.27.21 cluster-2.03.11 openais from svn, subrev 1152 version 0.80 LVM2.2.02.44 This is a five node cluster wich was just upgraded from cluster-2.01.00, node by node. All nodes went fine except when the last one, which runs the mailman queue manager was upgraded: after the upgrade as the manager is started, the system freezes completely. No error message in the screen or in the kernel log. The system responds to ping, that's all, but nothing can be done at the console except rebooting. Usually when this node is fenced off, shortly after the fencing node freezes as well. What I could find in the kernel log of this second machine is as follows: Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1 Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to acquire journal lock... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Trying to acquire journal lock... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Looking at journal... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Looking at journal... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Acquiring the transaction lock... Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Acquiring the transaction lock... Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Replaying journal... Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Replaying journal... Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Replayed 65 of 85 blocks Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: replays = 65, skips = 12, sames = 8 Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Replayed 888 of 994 blocks Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: replays = 888, skips = 66, sames = 40 Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Journal replayed in 1s Mar 26 23:09:26 lxserv1 kernel: GFS: fsid=kfki:services.1: jid=3: Done Does it indicate anything, which could help to fix the cluster? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From kadlec at mail.kfki.hu Fri Mar 27 06:47:06 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 07:47:06 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> Message-ID: On Fri, 27 Mar 2009, Ben Yarwood wrote: > Replaying a journal as below usually idicates a node has withdrawn from that > file system I believe. You should grep messages on all nodes for 'GFS', if > any node is repoting errors with this fs then it will need rebooting/fencing > before access to that fs can be achieved. The failining node is fenced off. Here are the steps to reproduce the freeze of the node: - all nodes are running and member of the cluster - start the mailman queue manager: the node freezes - the freezed node fenced off by a member of the cluster - I can see log messages as I wrote in my first mail: Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1 Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to acquire journal lock... [...] - sometimes (but not always) the fencing machine freezes as well and then therefore fenced off - third node has never freezed so far and the cluster thus remained in quorum - fenced off machines restarted, join the cluster and work until I start the mailman queue manager The daily backups of the whole GFS file systems are completed, so I assume it's not a filesystem corruption. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From kadlec at mail.kfki.hu Fri Mar 27 09:17:51 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 10:17:51 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> Message-ID: On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote: > The failining node is fenced off. Here are the steps to reproduce the > freeze of the node: > > - all nodes are running and member of the cluster > - start the mailman queue manager: the node freezes > - the freezed node fenced off by a member of the cluster > - I can see log messages as I wrote in my first mail: > > Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1 > Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to > acquire journal lock... > [...] > > - sometimes (but not always) the fencing machine freezes as well > and then therefore fenced off > - third node has never freezed so far and the cluster thus remained > in quorum > - fenced off machines restarted, join the cluster and work until I start > the mailman queue manager > > The daily backups of the whole GFS file systems are completed, so I assume > it's not a filesystem corruption. Trying to start mailman on another node causes instant freezing on that node as well, so it's node-independent. When the node freezes, the kernel log shows at the other nodes: Mar 27 08:58:38 lxserv1 kernel: dlm: closing connection to node 3 Mar 27 08:58:38 lxserv1 kernel: GFS: fsid=kfki:home.3: jid=0: Trying to acquire journal lock... Mar 27 08:58:38 lxserv1 kernel: GFS: fsid=kfki:services.3: jid=0: Trying to acquire journal lock... Mar 27 08:58:39 lxserv1 kernel: GFS: fsid=kfki:home.3: jid=0: Busy Mar 27 08:58:39 lxserv1 kernel: GFS: fsid=kfki:services.3: jid=0: Busy except one node where the journal replayed: Mar 27 08:58:38 lxserv0 kernel: dlm: closing connection to node 3 Mar 27 08:58:38 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Trying to acquire journal lock... Mar 27 08:58:38 lxserv0 kernel: GFS: fsid=kfki:home.1: jid=0: Trying to acquire journal lock... Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:home.1: jid=0: Busy Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Looking at journal... Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Acquiring the transaction lock... Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Replaying journal... Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Replayed 342 of 380 blocks Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: replays = 342, skips = 23, sames = 15 Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Journal replayed in 1s Mar 27 08:58:39 lxserv0 kernel: GFS: fsid=kfki:services.1: jid=0: Done The cman debug log shows three different patterns (it is unfortunate that openais does not timestamp the debug log), like these node2: +[TOTEM] entering GATHER state from 12. +[TOTEM] entering GATHER state from 0. +[TOTEM] Saving state aru 284b high seq received 284b +[TOTEM] Storing new sequence id for ring 3064 +[TOTEM] entering COMMIT state. +[TOTEM] entering RECOVERY state. +[TOTEM] position [0] member 192.168.192.6: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [1] member 192.168.192.7: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [2] member 192.168.192.15: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [3] member 192.168.192.18: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] Did not need to originate any messages in recovery. +[CLM ] CLM CONFIGURATION CHANGE +[CLM ] New Configuration: +[CLM ] r(0) ip(192.168.192.6) +[CLM ] r(0) ip(192.168.192.7) +[CLM ] r(0) ip(192.168.192.15) +[CLM ] r(0) ip(192.168.192.18) +[CLM ] Members Left: +[CLM ] r(0) ip(192.168.192.17) node4: +[TOTEM] entering GATHER state from 12. +[TOTEM] entering GATHER state from 0. +[TOTEM] Creating commit token because I am the rep. +[TOTEM] Saving state aru 284b high seq received 284b +[TOTEM] Storing new sequence id for ring 3064 +[TOTEM] entering COMMIT state. +[TOTEM] entering RECOVERY state. +[TOTEM] position [0] member 192.168.192.6: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [1] member 192.168.192.7: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [2] member 192.168.192.15: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [3] member 192.168.192.18: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] Did not need to originate any messages in recovery. +[TOTEM] Sending initial ORF token +[CLM ] CLM CONFIGURATION CHANGE +[CLM ] New Configuration: +[CLM ] r(0) ip(192.168.192.6) +[CLM ] r(0) ip(192.168.192.7) +[CLM ] r(0) ip(192.168.192.15) +[CLM ] r(0) ip(192.168.192.18) +[CLM ] Members Left: +[CLM ] r(0) ip(192.168.192.17) node5: +[TOTEM] The token was lost in the OPERATIONAL state. +[TOTEM] Receive multicast socket recv buffer size (288000 bytes). +[TOTEM] Transmit multicast socket send buffer size (288000 bytes). +[TOTEM] entering GATHER state from 2. +[TOTEM] entering GATHER state from 0. +[TOTEM] Saving state aru 284b high seq received 284b +[TOTEM] Storing new sequence id for ring 3064 +[TOTEM] entering COMMIT state. +[TOTEM] entering RECOVERY state. +[TOTEM] position [0] member 192.168.192.6: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [1] member 192.168.192.7: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [2] member 192.168.192.15: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] position [3] member 192.168.192.18: +[TOTEM] previous ring seq 12384 rep 192.168.192.6 +[TOTEM] aru 284b high delivered 284b received flag 1 +[TOTEM] Did not need to originate any messages in recovery. +[CLM ] CLM CONFIGURATION CHANGE +[CLM ] New Configuration: +[CLM ] r(0) ip(192.168.192.6) +[CLM ] r(0) ip(192.168.192.7) +[CLM ] r(0) ip(192.168.192.15) +[CLM ] r(0) ip(192.168.192.18) +[CLM ] Members Left: +[CLM ] r(0) ip(192.168.192.17) On the victim machine I started # strace -f -o strace.log /etc/init.d/mailman start and after the freeze hit Alt+SysReq+s, but it does not guarantee that the whole log file was synced. Newertheless, it seems mailman went through all the mailing list config files and entered the listening state before the freeze (/var/lib/mailman/locks is a symbolic link pointing to a GFS directory): [...] 14215 write(4, "/var/lib/mailman/locks/master-qr"..., 50) = 50 14215 close(4) = 0 14215 munmap(0xb7bdb000, 4096) = 0 14215 umask(022) = 02 14215 gettimeofday({1238140699, 692207}, NULL) = 0 14215 utimes("/var/lib/mailman/locks/master-qrunner.web0.14215.1", {1238248699, 692207}) = 0 14215 link("/var/lib/mailman/locks/master-qrunner.web0.14215.1", "/var/lib/mailman/locks/master-qrunner") = -1 EEXIST (File exists) 14215 stat64("/var/lib/mailman/locks/master-qrunner", {st_mode=S_IFREG|0664, st_size=50, ...}) = 0 14215 open("/var/lib/mailman/locks/master-qrunner", O_RDONLY|O_LARGEFILE) = 4 14215 fstat64(4, {st_mode=S_IFREG|0664, st_size=50, ...}) = 0 14215 fstat64(4, {st_mode=S_IFREG|0664, st_size=50, ...}) = 0 14215 _llseek(4, 0, [0], SEEK_CUR) = 0 14215 fstat64(4, {st_mode=S_IFREG|0664, st_size=50, ...}) = 0 14215 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7bdb000 14215 _llseek(4, 0, [0], SEEK_CUR) = 0 14215 read(4, "/var/lib/mailman/locks/master-qr"..., 4096) = 50 14215 read(4, "", 4096) = 0 14215 close(4) = 0 14215 munmap(0xb7bdb000, 4096) = 0 14215 gettimeofday({1238140699, 693608}, NULL) = 0 14215 gettimeofday({1238140699, 693644}, NULL) = 0 14215 stat64("/var/lib/mailman/locks/master-qrunner", {st_mode=S_IFREG|0664, st_size=50, ...}) = 0 14215 select(0, NULL, NULL, NULL, {0, 668025} I ran 'gfs_tool counters /gfs/services' in a one second sleep loop on all nodes, but I could not discover anything interesting there. The last entry on the failed node is: Fri Mar 27 08:58:18 CET 2009 locks 4630 locks held 1308 freeze count 0 incore inodes 269 metadata buffers 4 unlinked inodes 0 quota IDs 0 incore log buffers 0 log space used 0.15% meta header cache entries 23 glock dependencies 1 glocks on reclaim list 0 log wraps 1 outstanding LM calls 0 outstanding BIO calls 0 fh2dentry misses 0 glocks reclaimed 1315287 glock nq calls 25160827 glock dq calls 25160295 glock prefetch calls 1201 lm_lock calls 340274 lm_unlock calls 335838 lm callbacks 677487 address operations 680976 dentry operations 9891829 export operations 0 file operations 843043 inode operations 13932689 super operations 746420 vm operations 338069 block I/O reads 0 block I/O writes 0 In an attempt to trigger the freeze without mailman (if it is due to a corrupt fs) I ran find . -type f | while read x; do echo $x cat $x > /dev/null done in the mailman root directory but it produced nothing on the node itself I ran the commands, but another node frozen at that time. I'd be glad for any suggestion to solve the node freeze. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From kadlec at mail.kfki.hu Fri Mar 27 12:27:10 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 13:27:10 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> Message-ID: On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote: > In an attempt to trigger the freeze without mailman (if it is due to > a corrupt fs) I umounted the GFS filesystems on all nodes and ran fsck on all of them, just in case. Some unused inodes, unlinked inodes and bitmap differences were fixed. After bringing up everything, in half an our one node get frozen again, without starting/running mailman :-(. Sigh. The pressure is mounting to fix the cluster at any cost, and nothing remained but to downgrade to cluster-2.01.00/openais-0.80.3 which would be just ridiculous. Anything else we could do to stabilize the cluster nodes? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From nick at javacat.f2s.com Fri Mar 27 15:11:43 2009 From: nick at javacat.f2s.com (Nick Lunt) Date: Fri, 27 Mar 2009 15:11:43 -0000 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> Message-ID: <005401c9aeee$56f954c0$04ebfe40$@f2s.com> I have to suggest purchasing solaris, hp-ux or AIX for running enterprise clusters. Having been using Linux for over 10 years it is still not ready for production clustering. The lack of decent documentation and the number of cluster software updates that shaft your production systems is a joke. I often wish I'd never learnt Linux and stuck with solaris instead. Just my 2 cents worth but if you need to run a production server do not use Linux Cluster GFS unless you like your boss giving you grief. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Kadlecsik Jozsef Sent: 27 March 2009 12:27 To: linux clustering Subject: RE: [Linux-cluster] Freeze with cluster-2.03.11 On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote: > In an attempt to trigger the freeze without mailman (if it is due to > a corrupt fs) I umounted the GFS filesystems on all nodes and ran fsck on all of them, just in case. Some unused inodes, unlinked inodes and bitmap differences were fixed. After bringing up everything, in half an our one node get frozen again, without starting/running mailman :-(. Sigh. The pressure is mounting to fix the cluster at any cost, and nothing remained but to downgrade to cluster-2.01.00/openais-0.80.3 which would be just ridiculous. Anything else we could do to stabilize the cluster nodes? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From kadlec at mail.kfki.hu Fri Mar 27 15:20:30 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 16:20:30 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <005401c9aeee$56f954c0$04ebfe40$@f2s.com> References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> Message-ID: On Fri, 27 Mar 2009, Nick Lunt wrote: > I have to suggest purchasing solaris, hp-ux or AIX for running enterprise > clusters. Thanks, but currently - and in the foreseeable future - that's a no-option for us. And at the moment it'd be unpractical as the existing cluster must be fixed. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From rhurst at bidmc.harvard.edu Fri Mar 27 16:04:57 2009 From: rhurst at bidmc.harvard.edu (Robert Hurst) Date: Fri, 27 Mar 2009 12:04:57 -0400 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <005401c9aeee$56f954c0$04ebfe40$@f2s.com> References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> Message-ID: <1238169898.4486.26.camel@WSBID06223.bidmc.harvard.edu> I thought this list was for technical feedback that addresses the issue at hand, however, since one opinion was offered, I will express one of my own. On Fri, 2009-03-27 at 15:11 +0000, Nick Lunt wrote: > I have to suggest purchasing solaris, hp-ux or AIX for running enterprise > clusters. Hmmm... I have had a lot of success on those platforms, too, and you would have to expect that from a tightly integrated OS from the OEM ... including the extra "0" at the end of the price tag. Unfortunately, the extra "0" did not payback as well as planned, because we incurred downtime anyways -- from "unexpected" cluster failures -- with "unreasonable" patch cycles for obscure break-fix measures that also introduced regression issues. Clearly, sophisticated system architectures such as these require careful testing and expertise from its operations personnel to manage its complexity. > Having been using Linux for over 10 years it is still not ready for > production clustering. The lack of decent documentation and the number of > cluster software updates that shaft your production systems is a joke. I > often wish I'd never learnt Linux and stuck with solaris instead. Yeah, well that depends on what you mean by decent documentation. From my experiences, compared to what was available in November 2005 when I first trialed RHEL 4.2 with GFS 6.1 to what has been released this past July in 4.7, I would have to say its documentation is plentiful. And anything that is "missing" has been clearly filled with access into this list. The written materials aside, I have learned more by tuning in here for other customer site implementations, and then considering the differentials with my own. Our RHEL sales engineers and RHEL premium support have been helpful in our implementations and support of them after going into production. Also in my experiences, RHEL training is a top-shelf service that should not be ignored by those acclimated by turn-key solutions. > Just my 2 cents worth but if you need to run a production server do not use > Linux Cluster GFS unless you like your boss giving you grief. I can respect that statement! My bosses have learned by my education and example the need to tolerate the complexities with any cluster architecture ... and it becomes more palatable to them when you are also saving their operations budget by hundreds of thousands of dollars per year -- with an architecture that easily performs same, or better, than the fading commercial OSes. ________________________________________________________________________ Robert Hurst, Sr. Cach? Administrator Beth Israel Deaconess Medical Center 1135 Tremont Street, REN-7 Boston, Massachusetts 02120-2140 617-754-8754 ? Fax: 617-754-8730 ? Cell: 401-787-3154 Any technology distinguishable from magic is insufficiently advanced. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kadlec at mail.kfki.hu Fri Mar 27 17:19:50 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 18:19:50 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> Message-ID: Hi, Combing through the log files I found the following: Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec post_fail_delay Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs" Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1??e1?? Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success The line saying "can't get node number for node e1??e1??" might be innocent, but looks suspicious. Why fenced could not get the victim name? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From s.wendy.cheng at gmail.com Fri Mar 27 18:51:06 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Fri, 27 Mar 2009 14:51:06 -0400 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <005401c9aeee$56f954c0$04ebfe40$@f2s.com> References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> Message-ID: <49CD201A.309@gmail.com> ... [snip] ... > Sigh. The pressure is mounting to > fix the cluster at any cost, and nothing remained but to downgrade to > cluster-2.01.00/openais-0.80.3 which would be just ridiculous. > > I have doubts that GFS (i.e. GFS1) is tuned and well-maintained on newer versions of RHCS (as well as 2.6 based kernels). My impression is that GFS1 is supposed to be phased out starting from RHEL 5. So if you are running with GFS1, why downgrading RHCS is ridiculous ? Should GFS2 be recommended ? Did you open a Red Hat support ticket ? Linux is free but Red Hat engineers still need to eat like any other human being. I have *not* looked at RHCS for more than a year now - so my impression (and opinion) may not be correct. -- Wendy From kadlec at mail.kfki.hu Fri Mar 27 19:02:11 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 20:02:11 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <49CD201A.309@gmail.com> References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> <49CD201A.309@gmail.com> Message-ID: On Fri, 27 Mar 2009, Wendy Cheng wrote: > ... [snip] ... > > Sigh. The pressure is mounting to fix the cluster at any cost, and nothing > > remained but to downgrade to > > cluster-2.01.00/openais-0.80.3 which would be just ridiculous. > > I have doubts that GFS (i.e. GFS1) is tuned and well-maintained on newer > versions of RHCS (as well as 2.6 based kernels). My impression is that > GFS1 is supposed to be phased out starting from RHEL 5. So if you are > running with GFS1, why downgrading RHCS is ridiculous ? We'd need features added to recent 2.6 kernels (like read-only bindmount), so the natural path was upgrading GFS1. However, as in the present state our cluster is unstable, either we have to find the culprit or go back to the proven version (and loosing the required new features). > Should GFS2 be recommended ? Did you open a Red Hat support ticket ? > Linux is free but Red Hat engineers still need to eat like any other > human being. We do not run RHEL5 but ubuntu hardy. GFS2 is not an option as it's officially still beta. And it's not possible to upgrade from GFS1 to GFS2 node-by-node as we did at moving from cluster-2.01.00 to cluster-2.03.11. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From rpeterso at redhat.com Fri Mar 27 19:35:52 2009 From: rpeterso at redhat.com (Bob Peterson) Date: Fri, 27 Mar 2009 15:35:52 -0400 (EDT) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <1011614586.1707211238182503739.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <519496029.1707231238182552406.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Kadlecsik Jozsef" wrote: | Hi, | | Combing through the log files I found the following: | | Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member | after 0 sec post_fail_delay | Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs" | Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node | e1??e1?? | Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success | | The line saying "can't get node number for node e1??e1??" might be | innocent, but looks suspicious. Why fenced could not get the victim | name? | | Best regards, | Jozsef Hi This leads me to believe that this is a cluster problem, not a GFS problem. If a node is fenced, GFS can't give out new locks until the fenced node is properly deal with by the cluster software. Therefore, GFS can appear to hang until the dead node is resolved. Did web1-gfs get rebooted and brought back in to the cluster? Regards, Bob Peterson Red Hat GFS From kadlec at mail.kfki.hu Fri Mar 27 20:01:24 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Fri, 27 Mar 2009 21:01:24 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <519496029.1707231238182552406.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <519496029.1707231238182552406.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: On Fri, 27 Mar 2009, Bob Peterson wrote: > | Combing through the log files I found the following: > | > | Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member > | after 0 sec post_fail_delay > | Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs" > | Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node > | e1??e1?? > | Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success > | > | The line saying "can't get node number for node e1??e1??" might be > | innocent, but looks suspicious. Why fenced could not get the victim > | name? > > This leads me to believe that this is a cluster problem, > not a GFS problem. If a node is fenced, GFS can't give out > new locks until the fenced node is properly deal with by > the cluster software. Therefore, GFS can appear to hang until > the dead node is resolved. Did web1-gfs get rebooted and > brought back in to the cluster? Yes. Probably it's worth to summarize what's happening here: - Full, healthy-looking cluster with all of the five nodes joined runs smoothly. - One node freezes out of the blue; it can reliably be triggered anytime by starting mailman, which works over GFS. - The freezed node gets fenced off - I assume it's not reversed and the node freezes *because* it got fenced. As we use AOE, the fencing happens at AOE level and the node is *not* rebooted automatically but the access right to the AOE devices are withdrawn. Freeze means there's no response at the console. The node still answers to ping, but nothing else. There's no a single error message in the kernel log or at the console screen. GFS does not freeze at all. There's a short pause, but then it works fine until the quorum is lost as more nodes fall out. We tried vanilla kernels 2.6.27.14 and 2.6.27.21 with the same results so I don't think it's a kernel problem. It >looks< either a GFS kernel module or an openais problem, if latter (as the victim machine fenced off) can cause system freeze. In daytime (active users) it was like an infenction: in ten minutes after bringing back the machines one failed, then shortly after another too. Now, since 17:22 (more than three hours) the cluster runs smoothly, but it's lightly used. However a node can be killed anytime by starting that damned mailman, which should run. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From rpeterso at redhat.com Fri Mar 27 20:11:17 2009 From: rpeterso at redhat.com (Bob Peterson) Date: Fri, 27 Mar 2009 16:11:17 -0400 (EDT) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: Message-ID: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "Kadlecsik Jozsef" wrote: | Yes. Probably it's worth to summarize what's happening here: | | - Full, healthy-looking cluster with all of the five nodes joined | runs smoothly. | - One node freezes out of the blue; it can reliably be triggered | anytime by starting mailman, which works over GFS. | - The freezed node gets fenced off - I assume it's not reversed and | the node freezes *because* it got fenced. Hi, Perhaps you should change your post_fail_delay to some very high number, recreate the problem, and when it freezes force a sysrq-trigger to get call traces for all the processes. Then also you can look at the dmesg to see if there was a kernel panic or something on the node that would otherwise be immediately fenced. Regards, Bob Peterson Red Hat GFS From kadlec at mail.kfki.hu Fri Mar 27 23:59:34 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Sat, 28 Mar 2009 00:59:34 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: On Fri, 27 Mar 2009, Bob Peterson wrote: > Perhaps you should change your post_fail_delay to some very high > number, recreate the problem, and when it freezes force a > sysrq-trigger to get call traces for all the processes. > Then also you can look at the dmesg to see if there was a kernel > panic or something on the node that would otherwise be > immediately fenced. I enabled more kernel debugging, netconsole and captured the attaced console log. I hope it gives the required info. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary -------------- next part -------------- ============================================= [ INFO: possible recursive locking detected ] 2.6.27.21 #6 --------------------------------------------- dlm_controld/3536 is trying to acquire lock: (&sb->s_type->i_mutex_key#8/2){--..}, at: [] configfs_attach_group+0x43/0x18f [configfs] but task is already holding lock: (&sb->s_type->i_mutex_key#8/2){--..}, at: [] configfs_attach_group+0x43/0x18f [configfs] other info that might help us debug this: 2 locks held by dlm_controld/3536: #0: (&sb->s_type->i_mutex_key#7/1){--..}, at: [] lookup_create+0x1b/0x85 #1: (&sb->s_type->i_mutex_key#8/2){--..}, at: [] configfs_attach_group+0x43/0x18f [configfs] stack backtrace: Pid: 3536, comm: dlm_controld Not tainted 2.6.27.21 #6 [] validate_chain+0x9bd/0xf35 [] __lock_acquire+0x27b/0x932 [] lock_acquire+0x5f/0x77 [] configfs_attach_group+0x43/0x18f [configfs] [] mutex_lock_nested+0xa1/0x25c [] configfs_attach_group+0x43/0x18f [configfs] [] configfs_attach_group+0x43/0x18f [configfs] [] configfs_attach_group+0x43/0x18f [configfs] [] configfs_attach_group+0xe5/0x18f [configfs] [] configfs_mkdir+0x1fb/0x36f [configfs] [] vfs_mkdir+0x9c/0xdf [] _spin_unlock+0x14/0x1c [] sys_mkdirat+0xc5/0xd2 [] trace_hardirqs_on_thunk+0xc/0x10 [] sys_mkdir+0x1f/0x23 [] sysenter_do_call+0x12/0x35 ======================= aoe: e1.0: setting 8704 byte data frames on eth1:0030486543a9 aoe: e2.0: setting 8704 byte data frames on eth1:003048654382 aoe: e2.0: setting 8704 byte data frames on eth1:003048654383 aoe: 0030486543a9 e1.0 vace0 has 2930382333 sectors aoe: 003048654382 e2.0 vace0 has 2930382333 sectors etherd/e1.0: unknown partition table etherd/e2.0: unknown partition table dlm: Using TCP for communications dlm: got connection from 1 dlm: got connection from 5 dlm: connecting to 2 dlm: got connection from 4 dlm: got connection from 2 Trying to join cluster "lock_dlm", "kfki:home" Joined cluster. Now mounting FS... GFS: fsid=kfki:home.3: jid=3: Trying to acquire journal lock... GFS: fsid=kfki:home.3: jid=3: Looking at journal... GFS: fsid=kfki:home.3: jid=3: Done Trying to join cluster "lock_dlm", "kfki:services" Joined cluster. Now mounting FS... GFS: fsid=kfki:services.3: jid=3: Trying to acquire journal lock... GFS: fsid=kfki:services.3: jid=3: Looking at journal... GFS: fsid=kfki:services.3: jid=3: Done GFS: fsid=kfki:home.3: fast statfs start time = 1238196765 GFS: fsid=kfki:services.3: fast statfs start time = 1238196770 SysRq : Show State task PC stack pid father init S f7061b14 0 1 0 f705f110 00000086 00000002 f7061b14 f7061b08 00000000 00000002 c045f0c0 c0461d00 c0461d00 c0461d00 f7061b18 f705f264 c373ed00 f6a8dd00 00000000 ffff9d54 00000000 00000246 c0318668 ffffffff 00000002 00000000 00000000 Call Trace: [] mutex_lock_nested+0x18c/0x25c [] schedule_timeout+0x69/0xa1 [] inotify_poll+0x45/0x4b [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] net_rx_action+0x85/0x1db [] net_rx_action+0x9d/0x1db [] __do_softirq+0xa2/0xf9 [] number+0x25b/0x264 [] restore_nocheck_notrace+0x0/0xe [] lockdep_stats_show+0x299/0x4f0 [] param_set_short+0x9/0x36 [] number+0x2/0x264 [] vsnprintf+0x2dc/0x581 [] validate_chain+0x395/0xf35 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] _spin_lock_irqsave+0x41/0x49 [] do_wait+0x31e/0x342 [] sys_select+0x3f/0x19e [] sysenter_do_call+0x12/0x35 ======================= kthreadd S f7063fac 0 2 0 f705e8a0 00000086 00000002 f7063fac f7063fa0 00000000 f705e8a0 c045f0c0 c0461d00 c0461d00 c0461d00 f7063fb0 f705e9f4 c373ed00 f7896d00 00000246 ffff989b 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] kthreadd+0x14a/0x14f [] kthreadd+0x0/0x14f [] kernel_thread_helper+0x7/0x1c ======================= migration/0 R running 0 3 2 f705e030 00000082 00000086 00000002 00000000 c3734d10 f705e030 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f705e184 c3734d00 f7967800 00000046 00000000 00000002 00000001 00000000 c011cefe 00000046 00000046 c3734d00 Call Trace: [] migration_thread+0x50/0x2cc [] migration_thread+0x147/0x2cc [] migration_thread+0x0/0x2cc [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ksoftirqd/0 S f706bfa8 0 4 2 f7069190 00000092 00000002 f706bfa8 f706bf9c 00000000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 f706bfac f70692e4 c3734d00 f7801800 c040caa0 ffffef69 00000000 c045f100 c0124797 ffffffff c0461b80 00000000 00000000 Call Trace: [] __do_softirq+0xa2/0xf9 [] ksoftirqd+0xee/0xf3 [] ksoftirqd+0x0/0xf3 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= migration/1 S 00000002 0 5 2 f7068920 00000082 00000086 00000002 00000000 c373ed10 f7068920 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f7068a74 c373ed00 f64a8580 00000046 00000000 00000002 00000001 00000000 c011cefe 00000046 00000046 c373ed00 Call Trace: [] migration_thread+0x50/0x2cc [] migration_thread+0x147/0x2cc [] migration_thread+0x0/0x2cc [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ksoftirqd/1 S f7071fa8 0 6 2 f70680b0 00000092 00000002 f7071fa8 f7071f9c 00000000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 f7071fac f7068204 c373ed00 f7c60800 c040caa0 ffffef7b 00000000 c045f100 c0124797 ffffffff c0461b80 00000000 00000000 Call Trace: [] __do_softirq+0xa2/0xf9 [] ksoftirqd+0xee/0xf3 [] ksoftirqd+0x0/0xf3 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= events/0 R running 0 7 2 f70729a0 00000082 00000002 f7079f94 f7079f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f7079f98 f7072af4 c3734d00 f78a5080 00000286 ffffef43 00000000 c0319cc1 00000000 ffffffff 00000046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= events/1 R running 0 8 2 f7072130 00000082 00013362 c373baa0 00000002 f707bf90 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f707bf98 f7072284 c373ed00 f7c60800 00000286 ffffef7e 00000000 c0319cc1 00000000 ffffffff 00000046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= khelper S 00000001 0 9 2 f707d290 00000082 f707d610 00000001 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f707d3e4 c3734d00 f781a080 00000286 00000001 f702b39c c0319cc1 00000000 f707d290 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kblockd/0 S 00000000 0 47 2 f707ca20 00000082 00000046 00000000 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f707cb74 c3734d00 f7c60300 00000286 00000001 f70a651c c0319cc1 00000000 00000046 00000046 f70a6524 00000286 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kblockd/1 S f70cbf94 0 48 2 f707c1b0 00000082 00000002 f70cbf94 f70cbf88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f70cbf98 f707c304 c373ed00 f781a300 00000286 ffff9376 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kacpid S f70cff94 0 49 2 f7083390 00000082 00000002 f70cff94 f70cff88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f70cff98 f70834e4 c373ed00 c03e81a0 00000286 ffff8af2 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kacpi_notify S f70d1f94 0 50 2 f7082b20 00000082 00000002 f70d1f94 f70d1f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f70d1f98 f7082c74 c373ed00 c03e81a0 00000286 ffff8af2 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ata/0 S 00000001 0 130 2 f7120da0 00000082 f7121120 00000001 00000002 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7120ef4 c3734d00 c03e81a0 00000286 00000001 f710dd1c c0319cc1 00000000 f7120da0 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ata/1 S f7125f94 0 131 2 f7121610 00000082 00000002 f7125f94 f7125f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f7125f98 f7121764 c373ed00 c03e81a0 00000286 ffff8af2 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ata_aux S f712df94 0 132 2 f711a4b0 00000082 00000002 f712df94 f712df88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f712df98 f711a604 c3734d00 c03e81a0 00000286 ffff8af2 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ksuspend_usbd S 00000001 0 133 2 f711ad20 00000082 f711b0a0 00000001 00000002 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f711ae74 c3734d00 c03e81a0 00000286 00000001 f710da1c c0319cc1 00000000 f711ad20 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= khubd S c373baa0 0 139 2 f711d690 00000082 00000e14 c373baa0 00000002 f7153f0c 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f7153f14 f711d7e4 c373ed00 f6b40300 00000286 ffff8ceb 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] hub_thread+0x87f/0xcef [] trace_hardirqs_on_caller+0x9d/0x113 [] _spin_unlock_irq+0x20/0x23 [] finish_task_switch+0x58/0xab [] schedule+0x228/0x708 [] autoremove_wake_function+0x0/0x35 [] hub_thread+0x0/0xcef [] trace_hardirqs_on_caller+0x9d/0x113 [] hub_thread+0x0/0xcef [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kseriod S c3731aa0 0 142 2 f711ce20 00000082 00000400 c3731aa0 00000002 f7157f80 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f7157f88 f711cf74 c3734d00 f781a580 00000286 ffff8ca9 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] serio_thread+0xa4/0x2c6 [] autoremove_wake_function+0x0/0x35 [] serio_thread+0x0/0x2c6 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= pdflush S 00000002 0 181 2 f7158920 00000082 00000092 00000002 00000000 c03edd8c f7158920 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f7158a74 c3734d00 c03e81a0 00000046 00000000 00000002 00000001 00000000 c0154020 f7158920 c0319c5a c0153fc2 Call Trace: [] pdflush+0x5e/0x19c [] _spin_unlock_irq+0x20/0x23 [] pdflush+0x0/0x19c [] trace_hardirqs_on_caller+0x9d/0x113 [] pdflush+0x0/0x19c [] pdflush+0xb2/0x19c [] trace_hardirqs_on_caller+0x9d/0x113 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= pdflush R running 0 182 2 f7159190 00000082 00000002 f71c3f94 f71c3f88 00000000 f7159190 c045f0c0 c0461d00 c0461d00 c0461d00 f71c3f98 f71592e4 c3734d00 f7c60300 00000046 ffffee63 00000000 00000001 00000000 ffffffff 00000046 00000000 00000000 Call Trace: [] pdflush+0x0/0x19c [] pdflush+0xb2/0x19c [] wb_kupdate+0x0/0xe3 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kswapd0 S f71e7f28 0 183 2 f7130030 00000096 00000002 f71e7f28 f71e7f1c 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f71e7f2c f7130184 c373ed00 c03e81a0 00000286 ffff8af2 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] kswapd+0x3fd/0x420 [] schedule+0x228/0x708 [] autoremove_wake_function+0x0/0x35 [] _spin_unlock_irqrestore+0x34/0x39 [] kswapd+0x0/0x420 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= aio/0 S 00000001 0 184 2 f71308a0 00000082 f7130c20 00000001 00000002 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f71309f4 c3734d00 c03e81a0 00000286 00000001 f730dc1c c0319cc1 00000000 f71308a0 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= aio/1 S f732ff94 0 185 2 f7131110 00000082 00000002 f732ff94 f732ff88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f732ff98 f7131264 c373ed00 c03e81a0 00000286 ffff8af2 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kpsmoused S f6addf94 0 335 2 f6aa8a20 00000082 00000002 f6addf94 f6addf88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f6addf98 f6aa8b74 c373ed00 c03e81a0 00000286 ffff8bde 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kstriped S f6b03f94 0 339 2 f6a780b0 00000082 00000002 f6b03f94 f6b03f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f6b03f98 f6a78204 c373ed00 c03e81a0 00000286 ffff8be4 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= edac-poller S f6b0df94 0 343 2 f6aa09a0 00000082 00000002 f6b0df94 f6b0df88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f6b0df98 f6aa0af4 c3734d00 c03e81a0 00000286 ffff8be3 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= scsi_eh_0 S f79f7f68 0 1224 2 f7999410 00000096 00000002 f79f7f68 f79f7f5c 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f79f7f6c f7999564 c3734d00 f7903a80 f780f840 ffff8d11 00000000 f780f800 00000000 ffffffff 00000286 00000000 00000000 Call Trace: [] scsi_error_handler+0x0/0x2de [] scsi_error_handler+0x4b/0x2de [] __wake_up_common+0x46/0x68 [] _spin_unlock_irqrestore+0x34/0x39 [] scsi_error_handler+0x0/0x2de [] trace_hardirqs_on_caller+0x9d/0x113 [] scsi_error_handler+0x0/0x2de [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= scsi_eh_1 S f7c4bf68 0 1225 2 f7a0a330 00000096 00000002 f7c4bf68 f7c4bf5c 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f7c4bf6c f7a0a484 c3734d00 f78a5800 f7c53840 ffff8d7c 00000000 f7c53800 00000000 ffffffff 00000286 00000000 00000000 Call Trace: [] scsi_error_handler+0x0/0x2de [] scsi_error_handler+0x4b/0x2de [] __wake_up_common+0x46/0x68 [] _spin_unlock_irqrestore+0x34/0x39 [] scsi_error_handler+0x0/0x2de [] trace_hardirqs_on_caller+0x9d/0x113 [] scsi_error_handler+0x0/0x2de [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= scsi_eh_2 S f7c53000 0 1226 2 f7a1b510 00000096 f7c53050 f7c53000 00000000 00000046 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f7a1b510 f7a1b664 c373ed00 f7982800 f7c53040 00000286 00000286 f7c53000 00000000 c025645c 00000286 00000282 f7c53000 Call Trace: [] __scsi_iterate_devices+0x49/0x62 [] scsi_error_handler+0x0/0x2de [] scsi_error_handler+0x0/0x2de [] scsi_error_handler+0x4b/0x2de [] __wake_up_common+0x46/0x68 [] _spin_unlock_irqrestore+0x34/0x39 [] scsi_error_handler+0x0/0x2de [] trace_hardirqs_on_caller+0x9d/0x113 [] scsi_error_handler+0x0/0x2de [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= scsi_eh_3 S f78fd800 0 1227 2 f6a79190 00000096 f78fd850 f78fd800 00000000 00000046 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f6a79190 f6a792e4 c373ed00 f6b40300 f78fd840 00000286 00000286 f78fd800 00000000 c025645c 00000286 00000282 f78fd800 Call Trace: [] __scsi_iterate_devices+0x49/0x62 [] scsi_error_handler+0x0/0x2de [] scsi_error_handler+0x0/0x2de [] scsi_error_handler+0x4b/0x2de [] __wake_up_common+0x46/0x68 [] _spin_unlock_irqrestore+0x34/0x39 [] scsi_error_handler+0x0/0x2de [] trace_hardirqs_on_caller+0x9d/0x113 [] scsi_error_handler+0x0/0x2de [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= md2_raid1 S f7c6ff68 0 1391 2 f7a96b20 00000096 00000002 f7c6ff68 f7c6ff5c 00000000 f29c2230 c045f0c0 c0461d00 c0461d00 c0461d00 f7c6ff6c f7a96c74 c3734d00 f7967080 c01f4716 ffffef43 00000000 f79d7a18 00000001 ffffffff 00000046 00000000 00000000 Call Trace: [] __delay+0x6/0x7 [] md_thread+0x0/0xda [] schedule_timeout+0x69/0xa1 [] md_thread+0xb0/0xda [] autoremove_wake_function+0x0/0x35 [] md_thread+0x0/0xda [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= md2_resync R running 0 1392 2 f7a74130 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7a74284 c3734d00 f781a300 c04b9e00 00000286 c04b9e00 00000286 00000286 f7d57e28 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] delay_tsc+0x2a/0x47 [] process_timeout+0x0/0x5 [] msleep_interruptible+0x22/0x2f [] sync_request+0x5c7/0x795 [] is_mddev_idle+0x103/0x10f [] is_mddev_idle+0x9/0x10f [] sync_request+0x0/0x795 [] md_do_sync+0x320/0xdb7 [] validate_chain+0x395/0xf35 [] __lock_acquire+0x27b/0x932 [] _spin_unlock_irq+0x20/0x23 [] md_thread+0x0/0xda [] trace_hardirqs_on_caller+0x9d/0x113 [] md_thread+0x0/0xda [] md_thread+0x22/0xda [] _spin_unlock_irqrestore+0x34/0x39 [] md_thread+0x0/0xda [] trace_hardirqs_on_caller+0x9d/0x113 [] md_thread+0x0/0xda [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= md1_raid1 S f7445f68 0 1401 2 f79b1510 00000096 00000002 f7445f68 f7445f5c 00000000 f7073210 c045f0c0 c0461d00 c0461d00 c0461d00 f7445f6c f79b1664 c373ed00 f64d7d00 c01f4716 ffffd981 00000000 f7463418 00000001 ffffffff 00000046 00000000 00000000 Call Trace: [] __delay+0x6/0x7 [] md_thread+0x0/0xda [] schedule_timeout+0x69/0xa1 [] md_thread+0xb0/0xda [] autoremove_wake_function+0x0/0x35 [] md_thread+0x0/0xda [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= md0_raid1 S f7957f68 0 1409 2 f79f21b0 00000096 00000002 f7957f68 f7957f5c 00000000 f7073210 c045f0c0 c0461d00 c0461d00 c0461d00 f7957f6c f79f2304 c373ed00 f64d7d00 c01f4716 ffffd981 00000000 f7490d98 00000001 ffffffff 00000046 00000000 00000000 Call Trace: [] __delay+0x6/0x7 [] md_thread+0x0/0xda [] schedule_timeout+0x69/0xa1 [] md_thread+0xb0/0xda [] autoremove_wake_function+0x0/0x35 [] md_thread+0x0/0xda [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kdmflush S 00000001 0 1420 2 f79d0630 00000082 f79d09b0 00000001 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f79d0784 c373ed00 f7801d00 00000286 00000001 f788629c c0319cc1 00000000 f79d0630 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kdmflush S 00000001 0 1426 2 f71d0aa0 00000082 f71d0e20 00000001 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f71d0bf4 c373ed00 f7801d00 00000286 00000001 f7c5e51c c0319cc1 00000000 f71d0aa0 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kdmflush S 00000001 0 1432 2 f7a75210 00000082 f7a75590 00000001 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7a75364 c373ed00 f6b40a80 00000286 00000001 f74cf79c c0319cc1 00000000 f7a75210 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kjournald R running 0 1514 2 f7a264b0 00000082 00000046 f78c84d8 00000001 f78c8510 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7a26604 c3734d00 f7967080 00000286 00000001 f78c8510 c0319cc1 00000000 00000046 00000246 00000246 f78c8414 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] kjournald+0x1bc/0x1c5 [] autoremove_wake_function+0x0/0x35 [] kjournald+0x0/0x1c5 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= udevd S f759db14 0 1675 1 f7a43110 00000086 00000002 f759db14 f759db08 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f759db18 f7a43264 c373ed00 f7896d00 f78d8800 ffff989b 00000000 00000000 00000002 ffffffff f78d8800 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] pipe_poll+0x24/0x86 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] vsnprintf+0x2dc/0x581 [] number+0x25b/0x264 [] add_to_page_cache_locked+0x46/0x8f [] vsnprintf+0x2dc/0x581 [] _spin_lock+0x31/0x3c [] handle_mm_fault+0x4ea/0x8a4 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] _spin_unlock+0x14/0x1c [] do_wp_page+0x2a2/0x537 [] _spin_lock_irqsave+0x41/0x49 [] do_wait+0x1e7/0x342 [] do_page_fault+0xa5/0x848 [] sys_select+0x3f/0x19e [] sysenter_do_call+0x12/0x35 ======================= gfs2_scand R running 0 2420 2 f7a83290 00000092 00000046 00000000 00000002 00000001 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7a833e4 c373ed00 f64a8a80 f706e000 00000282 f706e000 00000282 00000282 f75cbfac f706e000 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] gfs2_scand+0x0/0x63 [gfs2] [] schedule_timeout+0x44/0xa1 [] _read_unlock+0x14/0x1c [] process_timeout+0x0/0x5 [] gfs2_scand+0x56/0x63 [gfs2] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= glock_workque S c3731aa0 0 2422 2 f7a82a20 00000082 00000afb c3731aa0 00000002 f7585f90 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f7585f98 f7a82b74 c3734d00 f64d7d00 00000286 ffff9049 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= glock_workque S 00000001 0 2423 2 f7120530 00000082 f71208b0 00000001 00000002 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7120684 c373ed00 f78ff080 00000286 00000001 f501be9c c0319cc1 00000000 f7120530 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= ksnapd S 00000001 0 2536 2 f79db110 00000082 f79db490 00000001 00000002 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f79db264 c373ed00 f7963a80 00000286 00000001 f501b89c c0319cc1 00000000 f79db110 c0319d8f c012ee03 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] worker_thread+0x0/0x89 [] trace_hardirqs_on_caller+0x9d/0x113 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kjournald S f6533f84 0 2623 2 f6a48ca0 00000082 00000002 f6533f84 f6533f78 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f6533f88 f6a48df4 c3734d00 f64a8a80 f6a48ca0 ffff9450 00000000 f5065d10 c013d220 ffffffff 00000246 00000000 00000000 Call Trace: [] mark_held_locks+0x62/0x73 [] kjournald+0x1bc/0x1c5 [] autoremove_wake_function+0x0/0x35 [] kjournald+0x0/0x1c5 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kjournald S f50b7f84 0 2624 2 f6a70030 00000082 00000002 f50b7f84 f50b7f78 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f50b7f88 f6a70184 c3734d00 f7c60a80 00000286 ffffeba1 00000000 c0319cc1 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] kjournald+0x1bc/0x1c5 [] autoremove_wake_function+0x0/0x35 [] kjournald+0x0/0x1c5 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kjournald R running 0 2625 2 f71e1410 00000082 00000046 f506bcd8 00000001 f506bd10 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f71e1564 c3734d00 f7967080 00000286 00000001 f506bd10 c0319cc1 00000000 00000046 00000246 00000246 f506bc14 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] kjournald+0x1bc/0x1c5 [] autoremove_wake_function+0x0/0x35 [] kjournald+0x0/0x1c5 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= getty S f50e3eb4 0 2962 1 f6ab8b20 00000082 00000002 f50e3eb4 f50e3ea8 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f50e3eb8 f6ab8c74 c373ed00 f78a5300 00000000 ffff9238 00000000 00000046 f78e797c ffffffff f7538000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] mark_held_locks+0x62/0x73 [] _spin_unlock_irqrestore+0x34/0x39 [] trace_hardirqs_on_caller+0x9d/0x113 [] read_chan+0x1e1/0x640 [] default_wake_function+0x0/0x8 [] tty_read+0x70/0x9d [] read_chan+0x0/0x640 [] tty_read+0x0/0x9d [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= getty S 0038a000 0 2963 1 f7a6b190 00000082 00000001 0038a000 00000000 c013eb96 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7a6b2e4 c373ed00 f7982800 00000000 f7a6b190 c03182b9 00000046 f65b897c f65b8800 f744d000 00000046 00000000 Call Trace: [] __lock_acquire+0x27b/0x932 [] mutex_lock_interruptible_nested+0x192/0x2b7 [] schedule_timeout+0x69/0xa1 [] mark_held_locks+0x62/0x73 [] _spin_unlock_irqrestore+0x34/0x39 [] trace_hardirqs_on_caller+0x9d/0x113 [] read_chan+0x1e1/0x640 [] default_wake_function+0x0/0x8 [] tty_read+0x70/0x9d [] read_chan+0x0/0x640 [] tty_read+0x0/0x9d [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= login S 00000002 0 2965 1 f7a2eda0 00000082 f7a2eda0 00000002 00000000 c040c990 f7a2eda0 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f7a2eef4 c3734d00 f64d7d00 00000000 00000002 00000002 00000000 00000000 c01228e1 00000246 00000246 c040c980 Call Trace: [] do_wait+0x96/0x342 [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= getty S 0038a000 0 2966 1 f7a26d20 00000082 00000001 0038a000 00000000 c013eb96 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7a26e74 c3734d00 f7967d00 00000000 f7a26d20 c03182b9 00000046 f748897c f7488800 f6461000 00000046 00000000 Call Trace: [] __lock_acquire+0x27b/0x932 [] mutex_lock_interruptible_nested+0x192/0x2b7 [] schedule_timeout+0x69/0xa1 [] mark_held_locks+0x62/0x73 [] _spin_unlock_irqrestore+0x34/0x39 [] trace_hardirqs_on_caller+0x9d/0x113 [] read_chan+0x1e1/0x640 [] default_wake_function+0x0/0x8 [] tty_read+0x70/0x9d [] read_chan+0x0/0x640 [] tty_read+0x0/0x9d [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= login S 00000002 0 2969 1 f7a02b20 00000082 f7a02b20 00000002 00000000 c040c990 f7a02b20 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f7a02c74 c3734d00 f7801300 00000000 00000002 00000002 00000000 00000000 c01228e1 00000246 00000246 c040c980 Call Trace: [] do_wait+0x96/0x342 [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= getty S 0038a000 0 2970 1 f78e45b0 00000082 00000001 0038a000 00000000 c013eb96 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f78e4704 c3734d00 f6505d00 00000000 f78e45b0 c03182b9 00000046 f7c0b17c f7c0b000 f6460000 00000046 00000000 Call Trace: [] __lock_acquire+0x27b/0x932 [] mutex_lock_interruptible_nested+0x192/0x2b7 [] schedule_timeout+0x69/0xa1 [] mark_held_locks+0x62/0x73 [] _spin_unlock_irqrestore+0x34/0x39 [] trace_hardirqs_on_caller+0x9d/0x113 [] read_chan+0x1e1/0x640 [] default_wake_function+0x0/0x8 [] tty_read+0x70/0x9d [] read_chan+0x0/0x640 [] tty_read+0x0/0x9d [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= syslogd R running 0 3016 1 f71d8b20 00000086 f75be31c f71d8b20 00000002 00000000 f6d175ac c045f0c0 c0461d00 c0461d00 c0461d00 00000020 f71d8c74 c3734d00 f7896300 00000000 00000000 c030b41c 00000246 00000246 f75be30c f75be30c 00000000 00000020 Call Trace: [] unix_peer_get+0x11/0x2b [] _spin_unlock+0x14/0x1c [] schedule_timeout+0x69/0xa1 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] __ext3_get_inode_loc+0xba/0x307 [] __ext3_journal_dirty_metadata+0x16/0x3b [] _spin_lock_irqsave+0x41/0x49 [] journal_stop+0x134/0x1c3 [] ext3_ordered_write_end+0xf0/0x14a [] generic_file_buffered_write+0x1a6/0x677 [] mnt_drop_write+0x1d/0xd8 [] _spin_lock+0x31/0x3c [] pipe_write+0x5c/0x450 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] pipe_write+0x0/0x450 [] do_sync_readv_writev+0xcb/0x107 [] mntput_no_expire+0x18/0xef [] autoremove_wake_function+0x0/0x35 [] copy_from_user+0x2d/0x59 [] rw_copy_check_uvector+0x64/0xf4 [] sys_select+0x3f/0x19e [] sigprocmask+0x68/0xc3 [] sys_rt_sigprocmask+0xbf/0xd4 [] sysenter_do_call+0x12/0x35 ======================= dd S f6489f04 0 3049 1 f6a605b0 00000086 00000002 f6489f04 f6489ef8 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f6489f08 f6a60704 c373ed00 f78ff580 00000282 ffff93bc 00000000 c0319cc1 00000000 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] do_syslog+0x278/0x399 [] autoremove_wake_function+0x0/0x35 [] kmsg_read+0x0/0x36 [] proc_reg_read+0x58/0x79 [] proc_reg_read+0x0/0x79 [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= klogd S 00000000 0 3050 1 f6aa9290 00000092 f6aa9628 00000000 f6aa9610 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6aa93e4 c373ed00 f7967580 f6aa9290 00000000 00000000 f6aa9290 c0318467 f75e8fd8 c013d34e f75e8fd4 f75e8ff4 Call Trace: [] __mutex_unlock_slowpath+0x89/0xf9 [] trace_hardirqs_on_caller+0x9d/0x113 [] pipe_wait+0x50/0x72 [] autoremove_wake_function+0x0/0x35 [] pipe_read+0xe3/0x3fe [] do_sync_read+0xd2/0x10e [] autoremove_wake_function+0x0/0x35 [] dnotify_parent+0x1f/0x5f [] do_sync_read+0x0/0x10e [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= mysqld_safe S f5dcff24 0 3150 1 f6ab9390 00000082 00000002 f5dcff24 f5dcff18 00000000 f6ab9390 c045f0c0 c0461d00 c0461d00 c0461d00 f5dcff28 f6ab94e4 c3734d00 f7832a80 00000000 ffff9273 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= mysqld S f50dfb14 0 3192 3150 f6aa1210 00000086 00000002 f50dfb14 00000002 00000000 f540132c c045f0c0 c0461d00 c0461d00 c0461d00 f50dfb18 f6aa1364 c3734d00 f781a300 00000000 00000046 00000000 00000002 00000001 f540131c 00000282 f7854280 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] schedule_timeout+0x69/0xa1 [] unix_poll+0x1a/0xab [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] __getblk+0x36/0x2a7 [] validate_chain+0x395/0xf35 [] __lock_acquire+0x27b/0x932 [] ext3_getblk+0xb9/0x1c5 [] ll_rw_block+0x42/0x10e [] release_sock+0x24/0xac [] _spin_lock_bh+0x36/0x41 [] ip_setsockopt+0x113/0xbad [] local_bh_enable_ip+0x79/0xb5 [] ip_setsockopt+0x113/0xbad [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] enqueue_task_fair+0xa2/0xb4 [] wake_up_new_task+0x87/0xa0 [] do_fork+0xce/0x247 [] sys_select+0x3f/0x19e [] sysenter_do_call+0x12/0x35 ======================= mysqld S f5de7df8 0 3200 3150 f7a42030 00000096 00000002 f5de7df8 f5de7dec 00000000 f7a42030 c045f0c0 c0461d00 c0461d00 c0461d00 f5de7dfc f7a42184 c373ed00 f781a300 f5de7e3c ffff9298 00000000 08e3d9f8 c0319cc1 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] futex_wait+0x318/0x396 [] __lock_acquire+0x27b/0x932 [] __lock_acquire+0x27b/0x932 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] enqueue_task_fair+0xa2/0xb4 [] _spin_unlock_irqrestore+0x34/0x39 [] trace_hardirqs_on_caller+0x9d/0x113 [] rwsem_wake+0x4d/0x116 [] call_rwsem_wake+0xa/0xc [] up_read+0x26/0x2a [] sys_futex+0x86/0x111 [] trace_hardirqs_on_thunk+0xc/0x10 [] trace_hardirqs_on_caller+0x9d/0x113 [] sysenter_do_call+0x12/0x35 ======================= mysqld S 00000002 0 3201 3150 f7a27590 00000096 f7a27590 00000002 00000000 c082d374 f7a27590 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f7a276e4 c3734d00 f781a300 f5081e3c 00000282 00000000 08e3da68 c0319cc1 00000046 00000046 f5081e3c 00000282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] futex_wait+0x318/0x396 [] try_to_wake_up+0xa8/0x137 [] _spin_unlock+0x14/0x1c [] futex_requeue+0xdb/0x251 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] find_get_pages_tag+0x11/0x140 [] ext3_write_inode+0x0/0x34 [] mapping_tagged+0x58/0x60 [] pagevec_lookup_tag+0x24/0x2c [] wait_on_page_writeback_range+0x5f/0x104 [] _spin_unlock+0x14/0x1c [] sync_inode+0x25/0x2a [] sys_futex+0x86/0x111 [] do_fsync+0x73/0x92 [] sysenter_do_call+0x12/0x35 ======================= mysqld S f79e8d20 0 3202 3150 f79e89a0 00000096 f79e89a0 f79e8d20 00000001 c082c6e8 f79e89a0 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f79e8af4 c3734d00 f781a300 f75dde3c 00000282 00000000 08e3dad8 c0319cc1 f79e89a0 c0319d8f 00000000 c013d34e Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] trace_hardirqs_on_caller+0x9d/0x113 [] futex_wait+0x318/0x396 [] __alloc_pages_internal+0xb6/0x41f [] validate_chain+0x395/0xf35 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] __lock_acquire+0x27b/0x932 [] autoremove_wake_function+0x0/0x35 [] dnotify_parent+0x1f/0x5f [] sys_futex+0x86/0x111 [] vfs_read+0xe3/0x11b [] trace_hardirqs_on_thunk+0xc/0x10 [] trace_hardirqs_on_caller+0x9d/0x113 [] sysenter_do_call+0x12/0x35 ======================= mysqld S 00000002 0 3203 3150 f7ac7610 00000096 f7ac7610 00000002 00000000 c082cea4 f7ac7610 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f7ac7764 c3734d00 f781a300 f64e3e3c 00000282 00000000 08e3db48 c0319cc1 00000046 00000046 f64e3e3c 00000282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] futex_wait+0x318/0x396 [] generic_file_aio_write+0x55/0xd8 [] generic_file_aio_write+0x73/0xd8 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] autoremove_wake_function+0x0/0x35 [] dnotify_parent+0x1f/0x5f [] sys_futex+0x86/0x111 [] vfs_write+0xe5/0x118 [] sysenter_do_call+0x12/0x35 ======================= mysqld R running 0 3205 3150 f79e9210 00000086 00000002 f7985b14 f7985b08 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f7985b18 f79e9364 c3734d00 f781a300 c04b9e00 ffffef43 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __lock_acquire+0x27b/0x932 [] validate_chain+0x395/0xf35 [] mark_held_locks+0x62/0x73 [] get_page_from_freelist+0x304/0x48d [] trace_hardirqs_on_caller+0x9d/0x113 [] __pollwait+0x0/0xba [] __lock_acquire+0x27b/0x932 [] enqueue_task_fair+0xa2/0xb4 [] try_to_wake_up+0xa8/0x137 [] _spin_lock_irqsave+0x41/0x49 [] del_timer+0x4b/0x53 [] scsi_delete_timer+0xb/0x1b [] queue_work_on+0x34/0x43 [] enqueue_task_fair+0xa2/0xb4 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] _spin_lock_irqsave+0x41/0x49 [] _spin_lock_irqsave+0x41/0x49 [] scsi_next_command+0x25/0x2f [] scsi_end_request+0x52/0x75 [] scsi_io_completion+0x87/0x3de [] _spin_lock+0x31/0x3c [] sys_select+0xd6/0x19e [] sysenter_do_call+0x12/0x35 ======================= mysqld R running 0 3206 3150 f7a1a430 00000086 00000002 f6583b14 f6583b08 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f6583b18 f7a1a584 c3734d00 f781a300 c04b9e00 ffffeed6 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] try_to_wake_up+0xa8/0x137 [] autoremove_wake_function+0x14/0x35 [] __wake_up_common+0x46/0x68 [] queue_work_on+0x34/0x43 [] tcp_rcv_established+0x341/0x70b [] tcp_v4_do_rcv+0x94/0x1af [] _spin_unlock+0x14/0x1c [] tcp_v4_rcv+0x498/0x5be [] ip_local_deliver_finish+0x10d/0x15f [] ip_local_deliver_finish+0x3f/0x15f [] nommu_map_single+0x0/0x63 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] e1000_clean_rx_irq+0x301/0x41a [e1000] [] e1000_clean_rx_irq+0x0/0x41a [e1000] [] e1000_clean+0x450/0x5f9 [e1000] [] _spin_lock_irqsave+0x41/0x49 [] scsi_end_request+0x52/0x75 [] net_rx_action+0x85/0x1db [] net_rx_action+0x9d/0x1db [] sys_select+0xd6/0x19e [] sysenter_do_call+0x12/0x35 ======================= mysqld S 00000002 0 3207 3150 f6a61690 00000096 f6a61690 00000002 00000000 c082d348 f6a61690 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f6a617e4 c3734d00 f781a300 f50d7e3c 00000282 00000000 08af80b0 c0319cc1 00000046 00000046 f50d7e3c 00000282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] futex_wait+0x318/0x396 [] _spin_lock_irqsave+0x41/0x49 [] futex_wake+0x52/0xe1 [] _spin_unlock+0x14/0x1c [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] scsi_next_command+0x25/0x2f [] scsi_io_completion+0x87/0x3de [] _spin_lock+0x31/0x3c [] sys_futex+0x86/0x111 [] sysenter_do_call+0x12/0x35 ======================= mysqld S f5117ecc 0 3208 3150 f79f2a20 00000082 00000002 f5117ecc f5117ec0 00000000 f7073210 c045f0c0 c0461d00 c0461d00 c0461d00 f5117ed0 f79f2b74 c373ed00 f781a300 00000000 ffffec7b 00000000 c373bf30 f79f2a20 ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] dequeue_signal+0x3c/0x12a [] sys_rt_sigtimedwait+0x87/0x285 [] sys_rt_sigtimedwait+0x145/0x285 [] sigprocmask+0x1f/0xc3 [] _spin_unlock_irq+0x20/0x23 [] sigprocmask+0x68/0xc3 [] sys_rt_sigprocmask+0xbf/0xd4 [] sysenter_do_call+0x12/0x35 ======================= mysqld S f651fd94 0 3209 3150 f79a43b0 00000082 00000002 f651fd94 f651fd88 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f651fd98 f79a4504 c3734d00 f781a300 00000002 ffffb5a6 00000000 f5dcd100 f5dcd12c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] sk_wait_data+0x54/0xa1 [] sk_wait_data+0x69/0xa1 [] autoremove_wake_function+0x0/0x35 [] tcp_recvmsg+0x42c/0x7ff [] sock_common_recvmsg+0x3e/0x54 [] sock_aio_read+0xdb/0xef [] do_sync_read+0xd2/0x10e [] autoremove_wake_function+0x0/0x35 [] vfs_read+0x114/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= mysqld S 00000002 0 3210 3150 f79b84b0 00000096 f79b84b0 00000002 00000000 c082da80 f79b84b0 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f79b8604 c3734d00 f781a300 f5017e3c 00000282 00000000 08edccf4 c0319cc1 00000046 00000046 f5017e3c 00000282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] futex_wait+0x318/0x396 [] __generic_file_aio_write_nolock+0x285/0x4df [] generic_file_aio_write+0x55/0xd8 [] generic_file_aio_write+0x73/0xd8 [] futex_wake+0x52/0xe1 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] autoremove_wake_function+0x0/0x35 [] dnotify_parent+0x1f/0x5f [] sys_futex+0x86/0x111 [] sys_write+0x68/0x6a [] sysenter_do_call+0x12/0x35 ======================= mysqld S 00000002 0 4053 3150 f6a49510 00000096 f6a49510 00000002 00000000 c082c6bc f6a49510 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f6a49664 c3734d00 f781a300 f209fe3c 00000282 00000000 0875bba0 c0319cc1 00000046 00000046 f209fe3c 00000282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] futex_wait+0x318/0x396 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] autoremove_wake_function+0x0/0x35 [] sys_futex+0x86/0x111 [] sys_read+0x68/0x6a [] sysenter_do_call+0x12/0x35 ======================= logger S 00000000 0 3193 3150 f6ab0230 00000092 f6ab05c8 00000000 f6ab05b0 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6ab0384 c3734d00 f7896580 f6ab0230 00000000 00000000 f6ab0230 c0318467 f6ddd768 c013d34e f6ddd764 f6ddd784 Call Trace: [] __mutex_unlock_slowpath+0x89/0xf9 [] trace_hardirqs_on_caller+0x9d/0x113 [] pipe_wait+0x50/0x72 [] autoremove_wake_function+0x0/0x35 [] pipe_read+0xe3/0x3fe [] __lock_acquire+0x27b/0x932 [] do_sync_read+0xd2/0x10e [] autoremove_wake_function+0x0/0x35 [] finish_task_switch+0x26/0xab [] schedule+0x228/0x708 [] do_sync_read+0x0/0x10e [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= master R running 0 3362 1 f6aa0130 00000086 00000002 f5173f1c f5173f10 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f5173f20 f6aa0284 c3734d00 f6a8d800 c04b9e00 ffffef68 00000000 00000282 00000282 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= mdadm R running 0 3403 1 f79fd310 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f79fd464 c3734d00 f7801d00 c04b9e00 00000286 c04b9e00 00000286 00000286 f7499b38 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] mark_held_locks+0x62/0x73 [] get_page_from_freelist+0x304/0x48d [] trace_hardirqs_on_caller+0x9d/0x113 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] enqueue_task_fair+0xa2/0xb4 [] _spin_lock_irqsave+0x41/0x49 [] md_wakeup_thread+0x25/0x29 [] md_ioctl+0x10c/0x1082 [] __d_lookup+0xbf/0x14e [] __d_lookup+0xe9/0x14e [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] do_open+0xc5/0x2b7 [] blkdev_open+0x0/0x53 [] blkdev_open+0x25/0x53 [] __dentry_open+0x176/0x20a [] nameidata_to_filp+0x35/0x3f [] mutex_lock_nested+0x18c/0x25c [] file_kill+0x13/0x2d [] sys_select+0xd6/0x19e [] filp_close+0x3e/0x62 [] sysenter_do_call+0x12/0x35 ======================= ccsd S f6509b14 0 3461 1 f79fcaa0 00000086 00000002 f6509b14 00000002 00000000 f55ef0ac c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f79fcbf4 c373ed00 f756c300 00000000 00000046 00000000 00000002 00000001 f55ef09c 00000282 f7910280 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] schedule_timeout+0x69/0xa1 [] unix_poll+0x1a/0xab [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] __lock_acquire+0x27b/0x932 [] sched_clock_cpu+0x10f/0x158 [] update_curr+0x40/0x52 [] try_to_wake_up+0xa8/0x137 [] autoremove_wake_function+0x14/0x35 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] _spin_unlock_irq+0x20/0x23 [] finish_task_switch+0x58/0xab [] finish_task_switch+0x26/0xab [] schedule+0x228/0x708 [] sys_select+0x3f/0x19e [] filp_close+0x3e/0x62 [] sysenter_do_call+0x12/0x35 ======================= ccsd S f5105b14 0 3462 1 f79d0ea0 00000086 00000002 f5105b14 f5105b08 00000000 f55f7aac c045f0c0 c0461d00 c0461d00 c0461d00 f5105b18 f79d0ff4 c3734d00 f756c300 00000000 ffff9486 00000000 00000002 00000001 ffffffff 00000282 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] unix_poll+0x1a/0xab [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __lock_acquire+0x27b/0x932 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] validate_chain+0x395/0xf35 [] check_noncircular+0xd/0xe7 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] __lock_acquire+0x27b/0x932 [] sched_clock_cpu+0x10f/0x158 [] update_curr+0x40/0x52 [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] __lock_acquire+0x27b/0x932 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] validate_chain+0x395/0xf35 [] sock_aio_read+0xdb/0xef [] validate_chain+0x395/0xf35 [] sockfd_lookup_light+0x24/0x41 [] __lock_acquire+0x27b/0x932 [] _spin_unlock_irq+0x20/0x23 [] trace_hardirqs_on_caller+0x9d/0x113 [] _spin_unlock_irq+0x20/0x23 [] finish_task_switch+0x58/0xab [] finish_task_switch+0x26/0xab [] schedule+0x228/0x708 [] dnotify_parent+0x1f/0x5f [] sys_select+0x3f/0x19e [] trace_hardirqs_on_thunk+0xc/0x10 [] trace_hardirqs_on_caller+0x9d/0x113 [] sysenter_do_call+0x12/0x35 ======================= ntpd S 00000000 0 3475 1 f711c5b0 00200082 00200046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f711c704 c3734d00 f78ffa80 c04b9e00 00200282 c04b9e00 00200282 00200282 f5025c18 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] release_sock+0x24/0xac [] _spin_lock_bh+0x36/0x41 [] release_sock+0x24/0xac [] _spin_lock_bh+0x36/0x41 [] udpv6_recvmsg+0x1ac/0x335 [ipv6] [] local_bh_enable_ip+0x79/0xb5 [] udpv6_recvmsg+0x1ac/0x335 [ipv6] [] enqueue_task_fair+0xa2/0xb4 [] _spin_lock_irqsave+0x41/0x49 [] _spin_lock_irqsave+0x41/0x49 [] scsi_next_command+0x25/0x2f [] scsi_end_request+0x52/0x75 [] scsi_io_completion+0x87/0x3de [] _spin_lock+0x31/0x3c [] dnotify_parent+0x1f/0x5f [] vfs_write+0xe5/0x118 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= ntpd S 00000000 0 3476 1 f6a708a0 00200082 00200046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6a709f4 c3734d00 f64d7800 c04b9e00 00200282 c04b9e00 00200282 00200282 f7457c18 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] do_wp_page+0x2a2/0x537 [] handle_mm_fault+0x1b3/0x8a4 [] _spin_lock+0x31/0x3c [] handle_mm_fault+0x4ea/0x8a4 [] task_rq_lock+0x2e/0x55 [] _spin_lock+0x31/0x3c [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] _read_unlock+0x14/0x1c [] unix_stream_sendmsg+0x1c3/0x32b [] sock_aio_write+0xcb/0xe8 [] find_lock_page+0x25/0x55 [] sock_aio_write+0x0/0xe8 [] do_sync_readv_writev+0xcb/0x107 [] __do_fault+0x192/0x3f0 [] dnotify_parent+0x1f/0x5f [] do_readv_writev+0x118/0x179 [] sock_aio_write+0x0/0xe8 [] up_read+0x14/0x2a [] do_page_fault+0x156/0x848 [] vfs_writev+0x3c/0x50 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= sshd S 00000000 0 3492 1 f79da030 00200086 00000042 00000000 00000042 00000002 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 00000032 f79da184 c373ed00 f756c580 f791b980 00000020 00200046 00000000 00000002 00000001 f5afd59c 00200282 f791b980 Call Trace: [] schedule_timeout+0x69/0xa1 [] tcp_poll+0x17/0x11e [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] ndisc_recv_ns+0x226/0x563 [ipv6] [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] __lock_acquire+0x27b/0x932 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] validate_chain+0x395/0xf35 [] mark_held_locks+0x62/0x73 [] get_page_from_freelist+0x304/0x48d [] trace_hardirqs_on_caller+0x9d/0x113 [] __lock_acquire+0x27b/0x932 [] mark_held_locks+0x62/0x73 [] _spin_unlock_irq+0x20/0x23 [] trace_hardirqs_on_caller+0x9d/0x113 [] _spin_unlock_irq+0x20/0x23 [] add_to_page_cache_locked+0x70/0x8f [] kmap_atomic+0x1c/0x21 [] shmem_getpage+0x4e9/0x661 [] kmap_atomic+0x1c/0x21 [] __lock_acquire+0x27b/0x932 [] __lock_acquire+0x27b/0x932 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] __do_fault+0x2bc/0x3f0 [] __do_fault+0x192/0x3f0 [] trace_hardirqs_on_caller+0x9d/0x113 [] handle_mm_fault+0x176/0x8a4 [] do_page_fault+0xa5/0x848 [] sys_select+0x3f/0x19e [] trace_hardirqs_on_thunk+0xc/0x10 [] do_page_fault+0x0/0x848 [] trace_hardirqs_on_caller+0x9d/0x113 [] sysenter_do_call+0x12/0x35 ======================= pickup S f6581f1c 0 3503 3362 f79e40b0 00200086 00000002 f6581f1c f6581f10 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f6581f20 f79e4204 c3734d00 f78ffd00 c04b9e00 ffffd9c4 00000000 00200282 00200282 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= qmgr R running 0 3504 3362 f71d1310 00200086 00000002 f513ff1c f513ff10 00000000 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f513ff20 f71d1464 c373ed00 f781a580 f706e000 ffffef69 00000000 00200282 00200282 ffffffff f706e000 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] vfs_ioctl+0x1f/0x6d [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= aisexec R running 0 3510 1 f79a4c20 00000082 00000400 c373baa0 00000002 f645fbf0 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f645fbf8 f79a4d74 c373ed00 f64a8a80 f706e000 ffffef78 00000000 00000282 00000282 ffffffff f706e000 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] read_tsc+0x6/0x28 [] getnstimeofday+0x44/0x100 [] copy_to_user+0x2f/0x4d [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f750fbf4 0 3511 1 f7aa4c20 00000082 00000002 f750fbf4 f750fbe8 00000000 f7aa4c20 c045f0c0 c0461d00 c0461d00 c0461d00 f750fbf8 f7aa4d74 c373ed00 f64a8a80 00000080 ffff93a4 00000000 00000002 c0555fa0 ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] mark_held_locks+0x62/0x73 [] get_page_from_freelist+0x304/0x48d [] trace_hardirqs_on_caller+0x9d/0x113 [] __lock_acquire+0x27b/0x932 [] validate_chain+0x395/0xf35 [] mark_held_locks+0x62/0x73 [] validate_chain+0x395/0xf35 [] __lock_acquire+0x27b/0x932 [] find_usage_backwards+0xc/0xce [] find_usage_backwards+0xc/0xce [] find_usage_backwards+0xc/0xce [] __lock_acquire+0x27b/0x932 [] validate_chain+0x395/0xf35 [] __lock_acquire+0x27b/0x932 [] __do_fault+0x2bc/0x3f0 [] __do_fault+0x192/0x3f0 [] handle_mm_fault+0x176/0x8a4 [] do_page_fault+0xa5/0x848 [] up_read+0x14/0x2a [] do_page_fault+0x156/0x848 [] trace_hardirqs_on_thunk+0xc/0x10 [] do_page_fault+0x0/0x848 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f4835df8 0 3513 1 f6a51590 00000096 00000002 f4835df8 f4835dec 00000000 f6a51590 c045f0c0 c0461d00 c0461d00 c0461d00 f4835dfc f6a516e4 c373ed00 f64a8a80 f4835e3c ffff9468 00000000 093b44c0 c0319cc1 ffffffff c0319d8f 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] _spin_unlock_irqrestore+0x34/0x39 [] futex_wait+0x318/0x396 [] __generic_file_aio_write_nolock+0x285/0x4df [] generic_file_aio_write+0x73/0xd8 [] futex_wake+0x52/0xe1 [] validate_chain+0x395/0xf35 [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] __lock_acquire+0x27b/0x932 [] autoremove_wake_function+0x0/0x35 [] dnotify_parent+0x1f/0x5f [] sys_futex+0x86/0x111 [] trace_hardirqs_on_thunk+0xc/0x10 [] trace_hardirqs_on_caller+0x9d/0x113 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f510c998 0 3529 1 f79b9590 00000086 00000246 f510c998 00000000 f510e000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f79b96e4 c373ed00 f64a8a80 00000000 f510c988 00000000 00000000 f510e000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] schedule+0x228/0x708 [] futex_wait+0xb6/0x396 [] _spin_unlock+0x14/0x1c [] futex_wait+0xeb/0x396 [] futex_wake+0x52/0xe1 [] _spin_unlock+0x14/0x1c [] futex_wake+0xc0/0xe1 [] do_futex+0x7a/0x951 [] __lock_acquire+0x27b/0x932 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f510c898 0 3541 1 f6a71110 00000086 00000246 f510c898 00000000 f5106000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f6a71264 c3734d00 f64a8a80 00000000 f510c888 00000001 00000000 f5106000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] copy_from_user+0x2d/0x59 [] copy_from_user+0x2d/0x59 [] verify_iovec+0x2a/0x8a [] sys_sendmsg+0x232/0x237 [] sys_socketcall+0xb7/0x2a9 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f5027d7c 0 3542 1 f7998330 00000086 00000002 f5027d7c f5027d70 00000000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 f5027d80 f7998484 c373ed00 f64a8a80 00000000 ffff985c 00000000 00000000 f5026000 ffffffff 00000246 00000000 00000000 Call Trace: [] sys_semtimedop+0x5a8/0x7a4 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f510c798 0 3556 1 f7a03390 00000086 00000246 f510c798 00000000 f48ac000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f7a034e4 c373ed00 f64a8a80 00000000 f510c788 00000003 00000000 f48ac000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] copy_from_user+0x2d/0x59 [] copy_from_user+0x2d/0x59 [] verify_iovec+0x2a/0x8a [] sys_sendmsg+0x232/0x237 [] sys_socketcall+0xb7/0x2a9 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f4884a98 0 3653 1 f6a60e20 00000086 00000246 f4884a98 00000000 f48d6000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f6a60f74 c3734d00 f64a8a80 00000000 f4884a88 00000004 00000000 f48d6000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] copy_from_user+0x2d/0x59 [] copy_from_user+0x2d/0x59 [] verify_iovec+0x2a/0x8a [] sys_sendmsg+0x232/0x237 [] sys_socketcall+0xb7/0x2a9 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f75a5d98 0 3686 1 f79c2530 00000086 00000246 f75a5d98 00000000 f4878000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f79c2684 c373ed00 f64a8a80 00000000 f75a5d88 00000005 00000000 f4878000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] copy_from_user+0x2d/0x59 [] copy_from_user+0x2d/0x59 [] verify_iovec+0x2a/0x8a [] sys_sendmsg+0x232/0x237 [] __dequeue_entity+0x45/0xaa [] update_curr+0x40/0x52 [] set_next_entity+0x14/0x38 [] pick_next_task_fair+0x8f/0xa8 [] sys_socketcall+0xb7/0x2a9 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f75a5e98 0 3697 1 f79d1710 00000086 00000246 f75a5e98 00000000 f4898000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f79d1864 c373ed00 f64a8a80 00000000 f75a5e88 00000006 00000000 f4898000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] copy_from_user+0x2d/0x59 [] copy_from_user+0x2d/0x59 [] verify_iovec+0x2a/0x8a [] sys_sendmsg+0x232/0x237 [] sys_socketcall+0xb7/0x2a9 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f64f4998 0 3710 1 f7155710 00000086 00000246 f64f4998 00000000 f50f4000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f7155864 c373ed00 f64a8a80 00000000 f64f4988 00000007 00000000 f50f4000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] __dequeue_entity+0x45/0xaa [] set_next_entity+0x14/0x38 [] pick_next_task_fair+0x8f/0xa8 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= aisexec S f64f4a98 0 3714 1 f711b590 00000086 00000246 f64f4a98 00000000 f75d6000 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 c01d720f f711b6e4 c3734d00 f64a8a80 00000000 f64f4a88 00000008 00000000 f75d6000 c01d720f 00000246 00000246 00000246 Call Trace: [] ipc_lock+0x60/0xb8 [] ipc_lock+0x60/0xb8 [] sys_semtimedop+0x5a8/0x7a4 [] sys_ipc+0x6a/0x293 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= groupd S f6465bf4 0 3527 1 f6a59610 00000082 00000002 f6465bf4 00000002 00000000 f7abc4b0 c045f0c0 c0461d00 c0461d00 c0461d00 f6465bf8 f6a59764 c373ed00 f7963580 f6465efc 00000046 00000000 00000002 00000001 f4ca931c 00000286 f78c2080 f6465efc Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] schedule_timeout+0x69/0xa1 [] unix_poll+0x1a/0xab [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] _read_unlock+0x14/0x1c [] unix_stream_sendmsg+0x1c3/0x32b [] sock_aio_write+0xcb/0xe8 [] sockfd_lookup_light+0x24/0x41 [] autoremove_wake_function+0x0/0x35 [] dnotify_parent+0x1f/0x5f [] vfs_write+0xe5/0x118 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= fenced S f5177bf4 0 3532 1 f6a48430 00000082 00000002 f5177bf4 f5177be8 00000000 f55f982c c045f0c0 c0461d00 c0461d00 c0461d00 f5177bf8 f6a48584 c373ed00 f78a5800 f5177ebc ffff95a0 00000000 00000002 00000001 ffffffff 00000286 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] unix_poll+0x1a/0xab [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] sched_clock_cpu+0x10f/0x158 [] update_curr+0x40/0x52 [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] unix_stream_recvmsg+0x3a0/0x4d8 [] sock_aio_read+0xdb/0xef [] sockfd_lookup_light+0x24/0x41 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] mntput_no_expire+0x18/0xef [] filp_close+0x3e/0x62 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= dlm_controld S c373baa0 0 3536 1 f7abc4b0 00000082 000007fb c373baa0 00000002 00000000 f55faaac c045f0c0 c0461d00 c0461d00 c0461d00 f4837bf8 f7abc604 c373ed00 f756c800 f4837ebc 00000046 00000000 00000002 00000001 f55faa9c 00000286 f79de080 f4837ebc Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] schedule_timeout+0x69/0xa1 [] unix_poll+0x1a/0xab [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] _spin_lock_irqsave+0x41/0x49 [] netlink_recvmsg+0x1d1/0x294 [] __wake_up_common+0x46/0x68 [] sock_recvmsg+0xcf/0xf3 [] autoremove_wake_function+0x0/0x35 [] sock_aio_read+0xdb/0xef [] sockfd_lookup_light+0x24/0x41 [] do_sync_read+0xd2/0x10e [] autoremove_wake_function+0x0/0x35 [] sys_recv+0x37/0x3b [] sys_socketcall+0x1a9/0x2a9 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= gfs_controld S f75b1bf4 0 3540 1 f7ac6da0 00000082 00000002 f75b1bf4 f75b1be8 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f75b1bf8 f7ac6ef4 c3734d00 f78a5080 f7ac6da0 ffffef62 00000000 f7b98a80 f75b1ed4 ffffffff f8939698 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] _spin_unlock+0x14/0x1c [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] ipc_lock+0x60/0xb8 [] _spin_lock+0x31/0x3c [] sys_semtimedop+0x2bf/0x7a4 [] autoremove_wake_function+0x0/0x35 [] _spin_lock_irq+0x37/0x42 [] futex_wake+0x52/0xe1 [] do_futex+0x7a/0x951 [] net_rx_action+0x9d/0x1db [] do_page_fault+0xa5/0x848 [] sys_futex+0x86/0x111 [] copy_to_user+0x2f/0x4d [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= clvmd S f50fbb14 0 3644 1 f71c81b0 00000086 00000002 f50fbb14 f50fbb08 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f50fbb18 f71c8304 c3734d00 f64a8300 c04b9e00 ffffde5e 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] _spin_unlock_irq+0x20/0x23 [] finish_task_switch+0x58/0xab [] finish_task_switch+0x26/0xab [] schedule+0x228/0x708 [] sched_clock_cpu+0x10f/0x158 [] update_curr+0x40/0x52 [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] unix_stream_recvmsg+0x3a0/0x4d8 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] sock_aio_read+0xdb/0xef [] sockfd_lookup_light+0x24/0x41 [] sys_recvfrom+0x11f/0x121 [] _spin_unlock_irq+0x20/0x23 [] finish_task_switch+0x58/0xab [] finish_task_switch+0x26/0xab [] schedule+0x228/0x708 [] dnotify_parent+0x1f/0x5f [] sys_select+0xd6/0x19e [] sysenter_do_call+0x12/0x35 ======================= clvmd S 00000002 0 3655 1 f7aa5490 00000086 f7aa5490 00000002 00000000 f75e02a0 f7aa5490 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f7aa55e4 c373ed00 f64a8300 f75e02d8 00000282 f75e02d8 b7dc3368 c0319cc1 00000046 00000246 00000246 f75e0290 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] device_read+0x12d/0x347 [dlm] [] default_wake_function+0x0/0x8 [] device_read+0x0/0x347 [dlm] [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= clvmd S 00000002 0 3656 1 f7ab3510 00000096 f7ab3510 00000002 00000000 c082d818 f7ab3510 c045f0c0 c0461d00 c0461d00 c0461d00 00000046 f7ab3664 c3734d00 f64a8300 f4913e3c 00000282 00000000 080b1fa4 c0319cc1 00000046 00000046 f4913e3c 00000282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] futex_wait+0x318/0x396 [] futex_wake+0x52/0xe1 [] _spin_unlock+0x14/0x1c [] default_wake_function+0x0/0x8 [] do_futex+0xbc/0x951 [] sys_futex+0x86/0x111 [] up_write+0x14/0x28 [] sysenter_do_call+0x12/0x35 ======================= dlm_astd S 00000002 0 3645 2 f7aa43b0 00000092 00000246 00000002 00000000 f8939428 f7aa43b0 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f7aa4504 c3734d00 f7801800 00000246 00000000 00000002 00000046 00000046 f8939440 00000000 f8939440 f8939460 Call Trace: [] gdlm_bast+0x0/0xa4 [gfs] [] dlm_astd+0x62/0x155 [dlm] [] dlm_astd+0x0/0x155 [dlm] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_scand R running 0 3646 2 f7abcd20 00000092 00000002 f50e7f88 f50e7f7c 00000000 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f50e7f8c f7abce74 c373ed00 f64a8a80 f706e000 ffffef4b 00000000 00000282 00000282 ffffffff f706e000 00000000 00000000 Call Trace: [] dlm_scand+0x0/0x77 [dlm] [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] dlm_scand+0x69/0x77 [dlm] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_recv/0 R running 0 3647 2 f78e5690 00000082 00000046 00000000 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f78e57e4 c3734d00 f7801800 00000286 00000001 f75e071c c0319cc1 00000000 00000046 00000046 f75e0724 00000286 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_recv/1 S 00000000 0 3648 2 f6a58da0 00000082 00000046 00000000 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6a58ef4 c373ed00 f756cd00 00000286 00000001 f75e069c c0319cc1 00000000 00000046 00000046 f75e06a4 00000286 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_send S f4823f94 0 3649 2 f6ab0aa0 00000082 00000002 f4823f94 f4823f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f4823f98 f6ab0bf4 c3734d00 f7801800 00000286 ffffef69 00000000 c0319cc1 00000000 ffffffff 00000046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_recoverd S f5dc1f80 0 3650 2 f79f3290 00000086 00000002 f5dc1f80 f5dc1f74 00000000 00002bd4 c045f0c0 c0461d00 c0461d00 c0461d00 f5dc1f84 f79f33e4 c3734d00 f64a8a80 00000002 ffff97ae 00000000 f79f3290 00000046 ffffffff f7001280 00000000 00000000 Call Trace: [] dlm_recoverd+0x163/0x55a [dlm] [] __wake_up_common+0x46/0x68 [] dlm_recoverd+0x0/0x55a [dlm] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kdmflush S 00000000 0 3663 2 f6ab82b0 00000082 00000046 00000000 00000002 00000001 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6ab8404 c373ed00 f7c60d00 00000286 00000001 f482bd9c c0319cc1 00000000 00000046 00000046 f482bda4 00000286 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= kdmflush S f6047f94 0 3669 2 f71d0230 00000082 00000002 f6047f94 f6047f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f6047f98 f71d0384 c373ed00 f756c800 00000286 ffff980b 00000000 c0319cc1 00000000 ffffffff 00000046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] worker_thread+0x0/0x89 [] worker_thread+0x7a/0x89 [] autoremove_wake_function+0x0/0x35 [] worker_thread+0x0/0x89 [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= lock_dlm1 S f5195f50 0 3690 2 f79c2da0 00000086 00000002 f5195f50 f5195f44 00000000 f79c2da0 c045f0c0 c0461d00 c0461d00 c0461d00 f5195f54 f79c2ef4 c373ed00 f7896300 00000246 ffffa9f2 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] gdlm_thread+0x112/0x74a [gfs] [] autoremove_wake_function+0x0/0x35 [] gdlm_thread1+0x0/0xa [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= lock_dlm2 S 00000002 0 3694 2 f7a022b0 00000086 c040dc00 00000002 00000000 f78d2b64 f7a022b0 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f7a02404 c3734d00 f78a5580 00000246 00000000 00000002 00000000 00000000 f8a45d59 00000246 00000246 f78d2b54 Call Trace: [] gdlm_thread+0x12d/0x74a [gfs] [] gdlm_thread+0x112/0x74a [gfs] [] autoremove_wake_function+0x0/0x35 [] gdlm_thread2+0x0/0x7 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_recoverd S f8939428 0 3695 2 f7a428a0 00000086 00000000 f8939428 f7a428a0 f5052bf4 00002bd4 c045f0c0 c0461d00 c0461d00 c0461d00 f5181800 f7a429f4 c373ed00 f64a8a80 00000002 00000000 00000000 f7a428a0 00000046 00000000 f7001280 f704b000 f7001280 Call Trace: [] dlm_recoverd+0x163/0x55a [dlm] [] __wake_up_common+0x46/0x68 [] dlm_recoverd+0x0/0x55a [dlm] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_scand R running 0 3701 2 f7154ea0 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7154ff4 c3734d00 f78a5080 c04b9e00 00000286 c04b9e00 00000286 00000286 f4827fa4 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] gfs_scand+0x1b/0x48 [gfs] [] process_timeout+0x0/0x5 [] gfs_scand+0x39/0x48 [gfs] [] gfs_scand+0x0/0x48 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_glockd S f4897f94 0 3702 2 f7abd590 00000082 00000002 f4897f94 f4897f88 00000000 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 f4897f98 f7abd6e4 c373ed00 f64a8a80 00000286 ffffef4b 00000000 c0319cc1 00000000 ffffffff 00000046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] gfs_glockd+0x80/0xb4 [gfs] [] autoremove_wake_function+0x0/0x35 [] gfs_glockd+0x0/0xb4 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_recoverd S f5161f80 0 3704 2 f7a17490 00000086 00000002 f5161f80 f5161f74 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f5161f84 f7a175e4 c3734d00 f78a5580 c04b9e00 ffffde92 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_recoverd+0x1b/0x48 [gfs] [] process_timeout+0x0/0x5 [] gfs_recoverd+0x39/0x48 [gfs] [] gfs_recoverd+0x0/0x48 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_logd R running 0 3705 2 f7154630 00000082 00000400 c373baa0 00000002 f4893f2c f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f4893f34 f7154784 c373ed00 f781a580 f706e000 ffffef77 00000000 00000286 00000286 ffffffff f706e000 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_logd+0x23/0xb7 [gfs] [] process_timeout+0x0/0x5 [] gfs_logd+0x41/0xb7 [gfs] [] gfs_logd+0x0/0xb7 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_quotad R running 0 3706 2 f71e0ba0 00000086 00000002 f6039f64 f6039f58 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f6039f68 f71e0cf4 c3734d00 f7c60300 c04b9e00 ffffee49 00000000 00000282 00000282 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_quotad+0x88/0x16d [gfs] [] process_timeout+0x0/0x5 [] gfs_quotad+0xa6/0x16d [gfs] [] gfs_quotad+0x0/0x16d [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_inoded R running 0 3707 2 f71e0330 00000086 00000002 f6151f80 f6151f74 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f6151f84 f71e0484 c3734d00 f78a5580 c04b9e00 ffffea63 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_inoded+0x1b/0x48 [gfs] [] process_timeout+0x0/0x5 [] gfs_inoded+0x39/0x48 [gfs] [] gfs_inoded+0x0/0x48 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= lock_dlm1 S 00000002 0 3711 2 f71c8a20 00000086 00000000 00000002 00000000 f71bb764 f71c8a20 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f71c8b74 c3734d00 f7801800 00000246 00000000 00000002 00000000 00000000 f8a45d59 00000246 00000246 f71bb754 Call Trace: [] gdlm_thread+0x12d/0x74a [gfs] [] gdlm_thread+0x112/0x74a [gfs] [] autoremove_wake_function+0x0/0x35 [] gdlm_thread1+0x0/0xa [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= lock_dlm2 S f6157f50 0 3712 2 f6a68ea0 00000086 00000002 f6157f50 f6157f44 00000000 f6a68ea0 c045f0c0 c0461d00 c0461d00 c0461d00 f6157f54 f6a68ff4 c373ed00 f781a580 00000246 ffffef73 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] gdlm_thread+0x112/0x74a [gfs] [] autoremove_wake_function+0x0/0x35 [] gdlm_thread2+0x0/0x7 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= dlm_recoverd S f8939428 0 3713 2 f6a69710 00000086 00000000 f8939428 f6a69710 f518ebf4 00002bd4 c045f0c0 c0461d00 c0461d00 c0461d00 f64b5800 f6a69864 c373ed00 f7801580 00000002 00000000 00000000 f6a69710 00000046 00000000 f7001280 f704b000 f7001280 Call Trace: [] dlm_recoverd+0x163/0x55a [dlm] [] __wake_up_common+0x46/0x68 [] dlm_recoverd+0x0/0x55a [dlm] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_scand R running 0 3719 2 f71580b0 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f7158204 c3734d00 f78a5a80 c04b9e00 00000286 c04b9e00 00000286 00000286 f74b5fa4 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] gfs_scand+0x1b/0x48 [gfs] [] process_timeout+0x0/0x5 [] gfs_scand+0x39/0x48 [gfs] [] gfs_scand+0x0/0x48 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_glockd S f219112c 0 3720 2 f6ab1310 00000082 f2191158 f219112c 00000246 f8b44994 00000046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6ab1464 c3734d00 f7801800 00000286 00000001 f74b7fbc c0319cc1 00000000 00000046 00000046 f8b83504 00000286 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] gfs_glockd+0x80/0xb4 [gfs] [] autoremove_wake_function+0x0/0x35 [] gfs_glockd+0x0/0xb4 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_recoverd S f4b9bf80 0 3722 2 f79b0ca0 00000086 00000002 f4b9bf80 f4b9bf74 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f4b9bf84 f79b0df4 c3734d00 f64a8a80 c04b9e00 ffffdeeb 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_recoverd+0x1b/0x48 [gfs] [] process_timeout+0x0/0x5 [] gfs_recoverd+0x39/0x48 [gfs] [] gfs_recoverd+0x0/0x48 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_logd R running 0 3723 2 f79b0430 00000082 00000002 f4b9df30 f4b9df24 00000000 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f4b9df34 f79b0584 c373ed00 f64a8a80 f706e000 ffffef4b 00000000 00000286 00000286 ffffffff f706e000 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_logd+0x23/0xb7 [gfs] [] process_timeout+0x0/0x5 [] gfs_logd+0x41/0xb7 [gfs] [] gfs_logd+0x0/0xb7 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_quotad R running 0 3724 2 f6a58530 00000086 00000002 f4b9ff64 f4b9ff58 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f4b9ff68 f6a58684 c3734d00 f7c60300 c04b9e00 ffffee87 00000000 00000282 00000282 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_quotad+0x88/0x16d [gfs] [] process_timeout+0x0/0x5 [] gfs_quotad+0xa6/0x16d [gfs] [] gfs_quotad+0x0/0x16d [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= gfs_inoded R running 0 3725 2 f7a749a0 00000086 00000002 f4ba1f80 f4ba1f74 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f4ba1f84 f7a74af4 c3734d00 f781a300 c04b9e00 ffffeab3 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] gfs_inoded+0x1b/0x48 [gfs] [] process_timeout+0x0/0x5 [] gfs_inoded+0x39/0x48 [gfs] [] gfs_inoded+0x0/0x48 [gfs] [] kthread+0x34/0x55 [] kthread+0x0/0x55 [] kernel_thread_helper+0x7/0x1c ======================= lighttpd R running 0 3859 1 f79a5490 00200082 00000002 f2889bf4 f2889be8 00000000 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f2889bf8 f79a55e4 c373ed00 f7c60a80 f706e000 ffffef2f 00000000 00200282 00200282 ffffffff f706e000 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] default_wake_function+0x0/0x8 [] __qdisc_run+0x7f/0x1c5 [] _spin_lock+0x31/0x3c [] __qdisc_run+0x7f/0x1c5 [] _spin_lock_irqsave+0x41/0x49 [] _spin_lock_irqsave+0x41/0x49 [] scsi_next_command+0x25/0x2f [] scsi_end_request+0x52/0x75 [] scsi_io_completion+0x87/0x3de [] _spin_lock+0x31/0x3c [] blk_done_softirq+0x5b/0x68 [] __do_softirq+0xa2/0xf9 [] _local_bh_enable+0x44/0xa0 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= atd S f3de9f2c 0 3866 1 f79ca5b0 00000082 00000002 f3de9f2c f3de9f20 00000000 f6455ddc c045f0c0 c0461d00 c0461d00 c0461d00 f3de9f30 f79ca704 c3734d00 f7832800 f3de9f60 ffff9c98 00000000 0000000a c045ef20 ffffffff 00000350 00000000 00000000 Call Trace: [] do_nanosleep+0x6d/0x96 [] hrtimer_nanosleep+0x50/0xad [] hrtimer_wakeup+0x0/0x18 [] sys_nanosleep+0x58/0x5c [] sysenter_do_call+0x12/0x35 ======================= cron S f29dff2c 0 3877 1 f6a504b0 00000082 00000002 f29dff2c f29dff20 00000000 c04782ec c045f0c0 c0461d00 c0461d00 c0461d00 f29dff30 f6a50604 c3734d00 f7963080 f29dff60 ffffe623 00000000 00000036 c045ef20 ffffffff 00000044 00000000 00000000 Call Trace: [] do_nanosleep+0x6d/0x96 [] hrtimer_nanosleep+0x50/0xad [] hrtimer_wakeup+0x0/0x18 [] sys_nanosleep+0x58/0x5c [] sysenter_do_call+0x12/0x35 ======================= S99phplog R running 0 3887 1 f79cae20 00000082 00000002 f29eff2c f29eff20 00000000 f64189dc c045f0c0 c0461d00 c0461d00 c0461d00 f29eff30 f79caf74 c373ed00 f78ff800 f29eff60 ffffe5ac 00000000 00000036 c045ef20 ffffffff 0000003d 00000000 00000000 Call Trace: [] do_nanosleep+0x6d/0x96 [] hrtimer_nanosleep+0x50/0xad [] hrtimer_wakeup+0x0/0x18 [] sys_nanosleep+0x58/0x5c [] sysenter_do_call+0x12/0x35 ======================= lighttpd R running 0 3937 1 f71d82b0 00200082 00000002 f2825bf4 f2825be8 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f2825bf8 f71d8404 c3734d00 f78a5580 c04b9e00 ffffef43 00000000 00200282 00200282 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_sys_poll+0x238/0x311 [] __pollwait+0x0/0xba [] default_wake_function+0x0/0x8 [] e1000_xmit_frame+0x7a5/0xb7b [e1000e] [] local_bh_enable_ip+0x79/0xb5 [] ipt_do_table+0x1fe/0x496 [ip_tables] [] __qdisc_run+0x7f/0x1c5 [] _spin_lock+0x31/0x3c [] __qdisc_run+0x7f/0x1c5 [] dev_queue_xmit+0xa2/0x509 [] dev_queue_xmit+0xee/0x509 [] local_bh_enable+0x7f/0xc8 [] dev_queue_xmit+0xee/0x509 [] dev_queue_xmit+0x36/0x509 [] ip_finish_output+0x112/0x26b [] ip_local_out+0x15/0x17 [] ip_queue_xmit+0x195/0x327 [] do_IRQ+0x40/0x72 [] __generic_file_aio_write_nolock+0x285/0x4df [] trace_hardirqs_on_thunk+0xc/0x10 [] restore_nocheck_notrace+0x0/0xe [] tcp_v4_send_check+0x3b/0xc8 [] tcp_transmit_skb+0x363/0x659 [] generic_file_aio_write+0x73/0xd8 [] dput+0x97/0x145 [] mntput_no_expire+0x18/0xef [] filp_close+0x3e/0x62 [] sys_poll+0x2d/0x71 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 00000002 0 3944 1 f7ac6530 00000082 f7ac6530 00000002 00000000 c040c990 f7ac6530 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f7ac6684 c3734d00 f78ff300 00000000 00000002 00000002 00000000 00000000 c01228e1 00000246 00000246 c040c980 Call Trace: [] do_wait+0x96/0x342 [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 00000002 0 3956 1 f7a821b0 00000082 f7a821b0 00000002 00000000 c040c990 f7a821b0 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f7a82304 c3734d00 f78ff080 00000000 00000002 00000002 00000000 00000000 c01228e1 00000246 00000246 c040c980 Call Trace: [] do_wait+0x96/0x342 [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f2847f24 0 3968 1 f79c3610 00000082 00000002 f2847f24 f2847f18 00000000 f79c3610 c045f0c0 c0461d00 c0461d00 c0461d00 f2847f28 f79c3764 c3734d00 f7967300 00000000 ffff9d3f 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f41a9f24 0 3980 1 f6a50d20 00000082 00000002 f41a9f24 f41a9f18 00000000 f6a50d20 c045f0c0 c0461d00 c0461d00 c0461d00 f41a9f28 f6a50e74 c373ed00 f6b40800 00000000 ffff9d57 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f4117f24 0 3992 1 f7a0aba0 00000082 00000002 f4117f24 f4117f18 00000000 f7a0aba0 c045f0c0 c0461d00 c0461d00 c0461d00 f4117f28 f7a0acf4 c3734d00 f6b40300 00000000 ffff9d5f 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f402bf24 0 4005 1 f70822b0 00000082 00000002 f402bf24 f402bf18 00000000 f70822b0 c045f0c0 c0461d00 c0461d00 c0461d00 f402bf28 f7082404 c3734d00 f7801a80 00000000 ffff9d3f 00000000 00000000 00000000 ffffffff 00000246 00000000 00000000 Call Trace: [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 00000002 0 4017 1 f6aa81b0 00000082 f6aa81b0 00000002 00000000 c040c990 f6aa81b0 c045f0c0 c0461d00 c0461d00 c0461d00 00000246 f6aa8304 c3734d00 f7982d00 00000000 00000002 00000002 00000000 00000000 c01228e1 00000246 00000246 c040c980 Call Trace: [] do_wait+0x96/0x342 [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c373baa0 0 4019 3956 f7a2e530 00000082 00000400 c373baa0 00000002 f29cbe20 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f29cbe28 f7a2e684 c373ed00 f7982a80 f759e45c ffff9d55 00000000 00000001 f759e44c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4020 3956 f7998ba0 00000082 f7998ba0 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7998ba0 f7998cf4 c373ed00 f7982580 f759e45c 00000046 f4ca9d2c 00000001 f759e44c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 0000001f 0 4021 3956 f79e8130 00000082 c0405c80 0000001f 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f79e8130 f79e8284 c373ed00 f6505080 f759e45c 00000046 f4ca9d2c 00000001 f759e44c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4022 3956 f7a1aca0 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7a1aca0 f7a1adf4 c373ed00 f6505a80 f759e45c 00000046 f4ca9d2c 00000001 f759e44c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4023 3944 f7a2f610 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7a2f610 f7a2f764 c373ed00 f6505580 f782b75c 00000046 f36f932c 00000001 f782b74c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f2927e24 0 4024 3944 f6a78920 00000082 00000002 f2927e24 f2927e18 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f2927e28 f6a78a74 c3734d00 f6505800 f782b75c ffffb5a6 00000000 00000001 f782b74c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] sys_socketcall+0x23e/0x2a9 [] filp_close+0x3e/0x62 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c373baa0 0 4025 3944 f79da8a0 00000082 00000400 c373baa0 00000002 f2907e20 000800d0 c045f0c0 c0461d00 c0461d00 c0461d00 f2907e28 f79da9f4 c373ed00 f781a080 f782b75c ffff9d56 00000000 00000001 f782b74c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c373baa0 0 4026 3944 f7a16c20 00000082 00000400 c373baa0 00000002 f28ade20 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f28ade28 f7a16d74 c373ed00 f781a800 f782b75c ffff9d56 00000000 00000001 f782b74c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c01512af 0 4027 3968 f7ab2ca0 00000082 c3423b1c c01512af 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7ab2ca0 f7ab2df4 c3734d00 f781aa80 f782ba5c 00000046 f36f90ac 00000001 f782ba4c 00000046 00000000 00000002 00000001 Call Trace: [] rmqueue_bulk+0x67/0x71 [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4028 4005 f7a962b0 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7a962b0 f7a96404 c373ed00 f6b40a80 f759ea5c 00000046 f4ca90ac 00000001 f759ea4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S ffffffff 0 4029 4017 f79b8d20 00000082 00000001 ffffffff 00000000 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f79b8d20 f79b8e74 c3734d00 f781ad00 f7855a5c 00000046 f241e82c 00000001 f7855a4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] sys_socketcall+0x23e/0x2a9 [] filp_close+0x3e/0x62 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f7054bb4 0 4030 3968 f7a6a920 00000082 00000000 f7054bb4 f7a6a920 f7a6a920 000800d0 c045f0c0 c0461d00 c0461d00 c0461d00 f7a6a920 f7a6aa74 c3734d00 f7896a80 f782ba5c 00000046 f36f90ac 00000001 f782ba4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f2dcbe24 0 4031 4017 f71d9390 00000082 00000002 f2dcbe24 f2dcbe18 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f2dcbe28 f71d94e4 c3734d00 f7896080 f7855a5c ffffa048 00000000 00000001 f7855a4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] sys_socketcall+0x23e/0x2a9 [] filp_close+0x3e/0x62 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c01512af 0 4032 3968 f7a163b0 00000082 c3424e00 c01512af 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7a163b0 f7a16504 c3734d00 f7c60d00 f782ba5c 00000046 f36f90ac 00000001 f782ba4c 00000046 00000000 00000002 00000001 Call Trace: [] rmqueue_bulk+0x67/0x71 [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f4177e24 0 4033 4017 f7a6a0b0 00000082 00000002 f4177e24 f4177e18 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f4177e28 f7a6a204 c3734d00 f756ca80 f7855a5c ffffc6d5 00000000 00000001 f7855a4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] sys_socketcall+0x23e/0x2a9 [] filp_close+0x3e/0x62 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c01512af 0 4034 3968 f79e5190 00000082 c342544c c01512af 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f79e5190 f79e52e4 c3734d00 f756c080 f782ba5c 00000046 f36f90ac 00000001 f782ba4c 00000046 00000000 00000002 00000001 Call Trace: [] rmqueue_bulk+0x67/0x71 [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f2da3e24 0 4035 4017 f79e4920 00000082 00000002 f2da3e24 f2da3e18 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f2da3e28 f79e4a74 c3734d00 f756cd00 f7855a5c ffffe979 00000000 00000001 f7855a4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] sys_socketcall+0x23e/0x2a9 [] filp_close+0x3e/0x62 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c34260ec 0 4036 4005 f7a97390 00000082 00000000 c34260ec 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7a97390 f7a974e4 c373ed00 f7982080 f759ea5c 00000046 f4ca90ac 00000001 f759ea4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c3731aa0 0 4037 3980 f7ab2430 00000082 00000400 c3731aa0 00000002 f4007e20 000800d0 c045f0c0 c0461d00 c0461d00 c0461d00 f4007e28 f7ab2584 c3734d00 f6b40580 f783ed5c ffff9d3f 00000001 00000001 f783ed4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4038 3980 f79fc230 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f79fc230 f79fc384 c3734d00 f7982300 f783ed5c 00000046 f240dd2c 00000001 f783ed4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c01512af 0 4039 3980 f7a0b410 00000082 c34273c8 c01512af 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f7a0b410 f7a0b564 c373ed00 f64a8800 f783ed5c 00000046 f240dd2c 00000001 f783ed4c 00000046 00000000 00000002 00000001 Call Trace: [] rmqueue_bulk+0x67/0x71 [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c01512af 0 4040 4005 f78e4e20 00000082 c3427a14 c01512af 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f78e4e20 f78e4f74 c3734d00 f6a8d580 f759ea5c 00000046 f4ca90ac 00000001 f759ea4c 00000046 00000000 00000002 00000001 Call Trace: [] rmqueue_bulk+0x67/0x71 [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4041 3980 f71c9290 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f71c9290 f71c93e4 c373ed00 f7967a80 f783ed5c 00000046 f240dd2c 00000001 f783ed4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4042 4005 f79cb690 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f79cb690 f79cb7e4 c3734d00 f6a8da80 f759ea5c 00000046 f4ca90ac 00000001 f759ea4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f29b9e24 0 4043 3992 f2d7b210 00000082 00000002 f29b9e24 f29b9e18 00000000 00000000 c045f0c0 c0461d00 c0461d00 c0461d00 f29b9e28 f2d7b364 c3734d00 f6a8d080 f75ded5c ffffe350 00000000 00000001 f75ded4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] skb_dequeue+0x50/0x56 [] dput+0x97/0x145 [] sys_socketcall+0x23e/0x2a9 [] filp_close+0x3e/0x62 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4044 3992 f2d7a9a0 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f2d7a9a0 f2d7aaf4 c3734d00 f6a8d300 f75ded5c 00000046 f240daac 00000001 f75ded4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f5009e24 0 4045 3992 f2d7a130 00000082 00000002 f5009e24 f5009e18 00000000 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f5009e28 f2d7a284 c373ed00 f28a9d00 f75ded5c ffff9d5f 00000000 00000001 f75ded4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S c01512af 0 4046 3992 f29e1290 00000082 c34112c0 c01512af 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f29e1290 f29e13e4 c3734d00 f28a9a80 f75ded5c 00000046 f240daac 00000001 f75ded4c 00000046 00000000 00000002 00000001 Call Trace: [] rmqueue_bulk+0x67/0x71 [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f29e5e24 0 4047 3992 f29e0a20 00000082 00000002 f29e5e24 f29e5e18 00000000 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f29e5e28 f29e0b74 c373ed00 f28a9800 f75ded5c ffff9d60 00000000 00000001 f75ded4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f7054bb4 0 4048 3992 f29e01b0 00000082 00000000 f7054bb4 f29e01b0 f29e01b0 000800d0 c045f0c0 c0461d00 c0461d00 c0461d00 f29e01b0 f29e0304 c3734d00 f28a9580 f75ded5c 00000046 f240daac 00000001 f75ded4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S f29c5e24 0 4049 3992 f29c3310 00000082 00000002 f29c5e24 f29c5e18 00000000 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f29c5e28 f29c3464 c373ed00 f28a9300 f75ded5c ffff9d60 00000000 00000001 f75ded4c ffffffff 00000000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= php5-cgi S 001200d2 0 4050 3992 f29c2aa0 00000082 00000000 001200d2 00000246 c03ed880 0000000c c045f0c0 c0461d00 c0461d00 c0461d00 f29c2aa0 f29c2bf4 c3734d00 f28a9080 f75ded5c 00000046 f240daac 00000001 f75ded4c 00000046 00000000 00000002 00000001 Call Trace: [] schedule_timeout+0x69/0xa1 [] __skb_recv_datagram+0x65/0x22d [] autoremove_wake_function+0x0/0x35 [] skb_recv_datagram+0x22/0x27 [] unix_accept+0x56/0xe3 [] do_accept+0xf0/0x1d4 [] handle_mm_fault+0x176/0x8a4 [] up_read+0x14/0x2a [] do_page_fault+0xa5/0x848 [] do_page_fault+0x156/0x848 [] sys_socketcall+0x23e/0x2a9 [] trace_hardirqs_on_thunk+0xc/0x10 [] sysenter_do_call+0x12/0x35 ======================= bash S 00000000 0 4055 2969 f6a68630 00000082 00000002 00000000 f7191c3c f6a68630 c03eb0b0 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f6a68784 c3734d00 f7801580 00000000 00000002 00000001 00000046 f719197c f7191800 f7192000 00000046 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] _spin_lock_irqsave+0x41/0x49 [] read_chan+0x1e1/0x640 [] default_wake_function+0x0/0x8 [] tty_read+0x70/0x9d [] read_chan+0x0/0x640 [] tty_read+0x0/0x9d [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= bash S f1857eb4 0 4078 2965 f2836530 00000082 00000002 f1857eb4 f1857ea8 00000000 c03eb0b0 c045f0c0 c0461d00 c0461d00 c0461d00 f1857eb8 f2836684 c3734d00 f7832300 00000000 ffffe335 00000000 00000046 f748817c ffffffff f651a000 00000000 00000000 Call Trace: [] schedule_timeout+0x69/0xa1 [] _spin_lock_irqsave+0x41/0x49 [] read_chan+0x1e1/0x640 [] default_wake_function+0x0/0x8 [] tty_read+0x70/0x9d [] read_chan+0x0/0x640 [] tty_read+0x0/0x9d [] vfs_read+0x85/0x11b [] sys_read+0x41/0x6a [] sysenter_do_call+0x12/0x35 ======================= mailmanctl S f2017f24 0 4100 1 f2837610 00200082 00000002 f2017f24 f2017f18 00000000 f2837610 c045f0c0 c0461d00 c0461d00 c0461d00 f2017f28 f2837764 c373ed00 f7801080 00000000 ffffedab 00000000 00000000 00000000 ffffffff 00200246 00000000 00000000 Call Trace: [] do_wait+0x28c/0x342 [] default_wake_function+0x0/0x8 [] sys_wait4+0x69/0xa4 [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4101 4100 f29c2230 00000086 00000002 f18bbb14 f18bbb08 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f18bbb18 f29c2384 c3734d00 f7967080 c04b9e00 ffffef43 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] cpupri_set+0x96/0xbc [] __enqueue_rt_entity+0x87/0x151 [] try_to_wake_up+0xa8/0x137 [] __wake_up_common+0x46/0x68 [] _read_unlock+0x14/0x1c [] sock_queue_rcv_skb+0x8d/0xc6 [] _spin_unlock+0x14/0x1c [] udp_queue_rcv_skb+0x11b/0x223 [] __udp4_lib_lookup+0xed/0x110 [] __udp4_lib_rcv+0x324/0x7a6 [] ipt_local_in_hook+0x0/0x19 [iptable_filter] [] ip_local_deliver_finish+0x10d/0x15f [] ip_local_deliver_finish+0x3f/0x15f [] nommu_map_single+0x0/0x63 [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] e1000_clean_rx_irq+0x301/0x41a [e1000] [] e1000_clean_rx_irq+0x0/0x41a [e1000] [] e1000_clean+0x450/0x5f9 [e1000] [] gfs_readdir+0x329/0x3c2 [gfs] [] filldir_reg_func+0x0/0x1c9 [gfs] [] copy_to_user+0x2f/0x4d [] filldir64+0x0/0xc5 [] net_rx_action+0x85/0x1db [] net_rx_action+0x9d/0x1db [] sys_select+0xd6/0x19e [] _local_bh_enable+0x44/0xa0 [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4102 4100 f203b310 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f203b464 c3734d00 f78a5d00 c04b9e00 00000286 c04b9e00 00000286 00000286 f20a9b38 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] wait_for_common+0x26/0x11b [] find_get_page+0xa2/0xbf [] find_get_page+0xa2/0xbf [] find_get_page+0xd/0xbf [] do_filldir_main+0x28/0x1d1 [gfs] [] compare_dents+0x0/0x65 [gfs] [] wait_for_common+0x26/0x11b [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] gfs_glock_dq+0xf0/0x157 [gfs] [] _spin_lock+0x31/0x3c [] gfs_glock_dq+0x9a/0x157 [gfs] [] filldir64+0x0/0xc5 [] _spin_unlock+0x14/0x1c [] gfs_readdir+0x329/0x3c2 [gfs] [] filldir_reg_func+0x0/0x1c9 [gfs] [] copy_to_user+0x2f/0x4d [] filldir64+0x0/0xc5 [] file_kill+0x13/0x2d [] sys_select+0xd6/0x19e [] copy_to_user+0x2f/0x4d [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4103 4100 f203aaa0 00000086 00000000 c3726014 f203aaa0 f18c5ea0 00000246 c045f0c0 c0461d00 c0461d00 c0461d00 00000002 f203abf4 c3734d00 f7801800 c02a6018 00000246 00000246 f75a0c90 f75a0c90 f18c5e9c c3726004 c0319826 f64f4680 Call Trace: [] dm_get_table+0x11/0x2e [] _read_unlock+0x14/0x1c [] dm_table_unplug_all+0x26/0x35 [] io_schedule+0x1b/0x24 [] sync_page+0x2c/0x39 [] __wait_on_bit+0x42/0x5e [] sync_page+0x0/0x39 [] wait_on_page_bit+0x64/0x6b [] wake_bit_function+0x0/0x3c [] wait_on_page_writeback_range+0x9c/0x104 [] gfs_sync_page_i+0x28/0x5c [gfs] [] gfs_drop_inode+0x55/0x5d [gfs] [] iput+0x44/0x4a [] do_unlinkat+0xc6/0x140 [] do_munmap+0x183/0x1e0 [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4104 4100 00000283 00000283 000000e0 f71a0000 f88ed512 0000002a 00000001 c04b9e00 00000286 f18c9b38 f18c9b18 c0319cc1 f71a0618 0000002a f7874f00 00000046 c04b9e00 00000286 c04b9e00 f71a0000 f7097080 00009b38 00df0000 f71a0580 Call Trace: [] e1000_xmit_frame+0x7a5/0xb7b [e1000e] [] _spin_lock_irqsave+0x41/0x49 [] read_tsc+0x6/0x28 [] sched_slice+0x31/0x37 [] read_tsc+0x6/0x28 [] run_timer_softirq+0x30/0x192 [] _spin_unlock_irq+0x20/0x23 [] run_timer_softirq+0x16a/0x192 [] __do_softirq+0xa2/0xf9 [] _local_bh_enable+0x44/0xa0 [] trace_hardirqs_on_thunk+0xc/0x10 [] restore_nocheck_notrace+0x0/0xe [] native_read_tsc+0x8/0xf [] delay_tsc+0x12/0x47 [] __delay+0x6/0x7 [] _raw_spin_lock+0xe6/0x170 [] igrab+0xd/0x3b [] igrab+0xd/0x3b [] gfs_iget+0x38/0x1e7 [gfs] [] gfs_inode_attr_in+0xb/0x26 [gfs] [] inode_go_unlock+0x20/0x2a [gfs] [] gfs_glock_dq+0xc5/0x157 [gfs] [] gfs_glock_dq_uninit+0x8/0x10 [gfs] [] gfs_permission+0x59/0x63 [gfs] [] inode_permission+0x57/0x95 [] __link_path_walk+0x69/0xbf0 [] core_sys_select+0x22/0x335 [] path_walk+0x37/0x73 [] do_path_lookup+0x63/0x10d [] getname+0x89/0x98 [] user_path_at+0x37/0x68 [] copy_to_user+0x2f/0x4d [] sys_linkat+0x3d/0xf5 [] copy_to_user+0x2f/0x4d [] sys_select+0x183/0x19e [] sys_link+0x2f/0x33 [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4105 4100 f18cb390 00000086 00000002 f18cdb14 f18cdb08 00000000 f706e000 c045f0c0 c0461d00 c0461d00 c0461d00 f18cdb18 f18cb4e4 c373ed00 f7c60800 f706e000 ffffef7b 00000000 00000286 00000286 ffffffff f706e000 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] __pollwait+0x0/0xba [] wait_for_common+0x26/0x11b [] wait_for_common+0x26/0x11b [] _spin_unlock_irq+0x20/0x23 [] wait_for_common+0xcd/0x11b [] find_get_page+0xa2/0xbf [] gfs_sort+0x80/0x103 [gfs] [] do_filldir_main+0x28/0x1d1 [gfs] [] compare_dents+0x0/0x65 [gfs] [] getbuf+0xfe/0x196 [gfs] [] wait_for_common+0x26/0x11b [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] gfs_glock_dq+0xf0/0x157 [gfs] [] _spin_lock+0x31/0x3c [] gfs_glock_dq+0x9a/0x157 [gfs] [] filldir64+0x0/0xc5 [] _spin_unlock+0x14/0x1c [] gfs_readdir+0x329/0x3c2 [gfs] [] filldir_reg_func+0x0/0x1c9 [gfs] [] copy_to_user+0x2f/0x4d [] filldir64+0x0/0xc5 [] dput+0xa9/0x145 [] sys_select+0xd6/0x19e [] filp_close+0x3e/0x62 [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4106 4100 f18cab20 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f18cac74 c3734d00 f7c60580 c04b9e00 00000286 c04b9e00 00000286 00000286 f18cfb38 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] e1000_clean+0x450/0x5f9 [e1000] [] cpupri_set+0x96/0xbc [] __pollwait+0x0/0xba [] net_rx_action+0x85/0x1db [] net_rx_action+0x9d/0x1db [] __do_softirq+0xa2/0xf9 [] _local_bh_enable+0x44/0xa0 [] find_get_page+0xa2/0xbf [] find_get_page+0xa2/0xbf [] find_get_page+0xd/0xbf [] do_filldir_main+0x28/0x1d1 [gfs] [] compare_dents+0x0/0x65 [gfs] [] wait_for_common+0x26/0x11b [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] gfs_glock_dq+0xf0/0x157 [gfs] [] _spin_lock+0x31/0x3c [] gfs_glock_dq+0x9a/0x157 [gfs] [] filldir64+0x0/0xc5 [] _spin_unlock+0x14/0x1c [] gfs_readdir+0x329/0x3c2 [gfs] [] filldir_reg_func+0x0/0x1c9 [gfs] [] copy_to_user+0x2f/0x4d [] filldir64+0x0/0xc5 [] file_kill+0x13/0x2d [] sys_select+0xd6/0x19e [] copy_to_user+0x2f/0x4d [] sysenter_do_call+0x12/0x35 ======================= python R running 0 4107 4100 f18ca2b0 00000086 00000046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f18ca404 c3734d00 f7c60300 c04b9e00 00000286 c04b9e00 00000286 00000286 f5d0fb38 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] _spin_lock_irqsave+0x41/0x49 [] __pollwait+0x0/0xba [] wait_for_common+0x26/0x11b [] wait_for_common+0x26/0x11b [] _spin_unlock_irq+0x20/0x23 [] gfs_glock_dq+0xf0/0x157 [gfs] [] _spin_lock+0x31/0x3c [] gfs_glock_dq+0x9a/0x157 [gfs] [] dput+0x97/0x145 [] gfs_drevalidate+0x1a6/0x1f0 [gfs] [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] gfs_glock_dq+0xf0/0x157 [gfs] [] _spin_lock+0x31/0x3c [] gfs_glock_dq+0x9a/0x157 [gfs] [] copy_to_user+0x2f/0x4d [] cp_new_stat64+0xf9/0x10b [] sys_select+0xd6/0x19e [] sysenter_do_call+0x12/0x35 ======================= python S f2015b14 0 4108 4100 f283b410 00000086 00000002 f2015b14 f2015b08 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f2015b18 f283b564 c3734d00 f64a8080 c04b9e00 ffffeddf 00000000 00000286 00000286 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] do_select+0x3ef/0x4cb [] do_select+0x14/0x4cb [] e1000_clean+0x450/0x5f9 [e1000] [] __pollwait+0x0/0xba [] wait_for_common+0x26/0x11b [] wait_for_common+0x26/0x11b [] _spin_unlock_irq+0x20/0x23 [] wait_for_common+0xcd/0x11b [] find_get_page+0xa2/0xbf [] gfs_sort+0x80/0x103 [gfs] [] do_filldir_main+0x28/0x1d1 [gfs] [] compare_dents+0x0/0x65 [gfs] [] getbuf+0xfe/0x196 [gfs] [] wait_for_common+0x26/0x11b [] core_sys_select+0x1f3/0x335 [] core_sys_select+0x22/0x335 [] gfs_glock_dq+0xf0/0x157 [gfs] [] _spin_lock+0x31/0x3c [] gfs_glock_dq+0x9a/0x157 [gfs] [] filldir64+0x0/0xc5 [] _spin_unlock+0x14/0x1c [] gfs_readdir+0x329/0x3c2 [gfs] [] filldir_reg_func+0x0/0x1c9 [gfs] [] filldir64+0x0/0xc5 [] dput+0xa9/0x145 [] sys_select+0xd6/0x19e [] filp_close+0x3e/0x62 [] sysenter_do_call+0x12/0x35 ======================= smtpd R running 0 4109 3362 f283aba0 00200086 00200046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f283acf4 c3734d00 f64a8d00 c04b9e00 00200282 c04b9e00 00200282 00200282 f143bf40 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= proxymap S 00000002 0 4110 3362 f283a330 00200086 c3731f30 00000002 00000000 f1916ed0 f283a330 c045f0c0 c0461d00 c0461d00 c0461d00 00200046 f283a484 c3734d00 f7896800 f1916e90 f283a330 00000002 00000001 00000000 f1916ea0 00200046 f1916e90 f1916e80 Call Trace: [] schedule_timeout+0x69/0xa1 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= trivial-rewri R running 0 4111 3362 f143f490 00200086 00000002 f4851f1c f4851f10 00000000 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 f4851f20 f143f5e4 c3734d00 f6b40d00 c04b9e00 ffffef29 00000000 00200282 00200282 ffffffff c04b9e00 00000000 00000000 Call Trace: [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= cleanup R running 0 4112 3362 f2836da0 00200086 00200046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f2836ef4 c3734d00 f64a8580 c04b9e00 00200282 c04b9e00 00200282 00200282 f180df40 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= smtp S f19d9f50 0 4113 3362 f21781b0 00200086 00000002 f19d9f50 f19d9f44 00000000 00200046 c045f0c0 c0461d00 c0461d00 c0461d00 f19d9f54 f2178304 c373ed00 f2033d00 00200282 ffffef4b 00000000 c0319cc1 00000000 ffffffff 00200046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] flock_lock_file_wait+0x177/0x275 [] autoremove_wake_function+0x0/0x35 [] sys_flock+0x11d/0x122 [] sysenter_do_call+0x12/0x35 ======================= smtp S f154ff50 0 4114 3362 f2179290 00200086 00000002 f154ff50 f154ff44 00000000 00200046 c045f0c0 c0461d00 c0461d00 c0461d00 f154ff54 f21793e4 c373ed00 f6b40080 00200282 ffffef66 00000000 c0319cc1 00000000 ffffffff 00200046 00000000 00000000 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] flock_lock_file_wait+0x177/0x275 [] autoremove_wake_function+0x0/0x35 [] sys_flock+0x11d/0x122 [] sysenter_do_call+0x12/0x35 ======================= smtp S 00000000 0 4115 3362 f2178a20 00200086 00200046 00000000 00000002 00000001 00200046 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f2178b74 c3734d00 f7c60080 00200282 00000001 f1027f44 c0319cc1 00000000 00200046 00200046 f1027f44 00200282 Call Trace: [] _spin_lock_irqsave+0x41/0x49 [] flock_lock_file_wait+0x177/0x275 [] autoremove_wake_function+0x0/0x35 [] sys_flock+0x11d/0x122 [] sysenter_do_call+0x12/0x35 ======================= smtp R running 0 4116 3362 f1dbe67c f1dbf0c0 c0182513 f1dbe674 c0180230 f1dbe674 f1dbe67c c0181511 00000008 f1dbf0c0 c0171dc1 00000040 00000000 f1dbe674 f7020680 f2091980 f7041500 00000000 f2091980 c016f61c 0000000c f7041500 f7041580 0000000c Call Trace: [] iput+0x1d/0x4a [] d_kill+0x2b/0x44 [] dput+0x78/0x145 [] __fput+0x127/0x17f [] filp_close+0x3e/0x62 [] sys_close+0x5b/0x97 [] sysenter_do_call+0x12/0x35 ======================= smtp S 00000000 0 4117 3362 f1504ca0 00200086 00200046 00000000 00000002 00000001 c04b9e00 c045f0c0 c0461d00 c0461d00 c0461d00 00000000 f1504df4 c3734d00 f7963300 c04b9e00 00200282 c04b9e00 00200282 00200282 f154bf40 c04b9e00 c0128a5a 00000000 Call Trace: [] __mod_timer+0x94/0xa3 [] schedule_timeout+0x44/0xa1 [] process_timeout+0x0/0x5 [] sys_epoll_wait+0x15b/0x500 [] default_wake_function+0x0/0x8 [] sysenter_do_call+0x12/0x35 ======================= SysRq : Show backtrace of all active CPUs CPU0: Pid: 4116, comm: smtp Not tainted (2.6.27.21 #6) EIP: 0060:[] EFLAGS: 00200246 CPU: 0 EIP is at native_read_tsc+0xe/0xf EAX: ad617fe3 EBX: 00000000 ECX: 00000000 EDX: 000000f3 ESI: ad617f17 EDI: 00000001 EBP: 00000000 ESP: f1551eec DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 8005003b CR2: b7c6a84d CR3: 31011000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [] delay_tsc+0x2a/0x47 [] __delay+0x6/0x7 [] _raw_spin_lock+0xe6/0x170 [] _atomic_dec_and_lock+0x31/0x58 [] _atomic_dec_and_lock+0x31/0x58 [] iput+0x1d/0x4a [] d_kill+0x2b/0x44 [] dput+0x78/0x145 [] __fput+0x127/0x17f [] filp_close+0x3e/0x62 [] sys_close+0x5b/0x97 [] sysenter_do_call+0x12/0x35 ======================= SysRq : Emergency Sync SysRq : Emergency Sync SysRq : Show Regs Pid: 4116, comm: smtp Not tainted (2.6.27.21 #6) EIP: 0060:[] EFLAGS: 00200246 CPU: 0 EIP is at delay_tsc+0x2a/0x47 EAX: b8030ea1 EBX: 00000000 ECX: 00000000 EDX: 0000010b ESI: b8030dd5 EDI: 00000001 EBP: 00000000 ESP: f1551ef0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 8005003b CR2: b7c6a84d CR3: 31011000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 [] __delay+0x6/0x7 [] _raw_spin_lock+0xe6/0x170 [] _atomic_dec_and_lock+0x31/0x58 [] _atomic_dec_and_lock+0x31/0x58 [] iput+0x1d/0x4a [] d_kill+0x2b/0x44 [] dput+0x78/0x145 [] __fput+0x127/0x17f [] filp_close+0x3e/0x62 [] sys_close+0x5b/0x97 [] sysenter_do_call+0x12/0x35 ======================= SysRq : Show Memory Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 133 CPU 1: hi: 186, btch: 31 usd: 70 HighMem per-cpu: CPU 0: hi: 186, btch: 31 usd: 10 CPU 1: hi: 186, btch: 31 usd: 40 Active:53773 inactive:23336 dirty:1098 writeback:0 unstable:0 free:679348 slab:13056 mapped:11237 pagetables:922 bounce:0 DMA free:8804kB min:1140kB low:1424kB high:1708kB active:0kB inactive:0kB present:15788kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 868 3015 3015 Normal free:784680kB min:64392kB low:80488kB high:96588kB active:3792kB inactive:9156kB present:889680kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 17174 17174 HighMem free:1923908kB min:512kB low:40288kB high:80064kB active:211300kB inactive:84188kB present:2198284kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 3*4kB 3*8kB 2*16kB 3*32kB 3*64kB 4*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8804kB Normal: 104*4kB 367*8kB 229*16kB 82*32kB 63*64kB 29*128kB 23*256kB 7*512kB 8*1024kB 4*2048kB 181*4096kB = 784616kB HighMem: 1*4kB 0*8kB 1*16kB 2*32kB 1*64kB 1*128kB 0*256kB 1*512kB 2*1024kB 0*2048kB 469*4096kB = 1923860kB 32045 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 979832kB Total swap = 979832kB 786176 pages RAM 556800 pages HighMem 12334 pages reserved 58472 pages shared 71957 pages non-shared SysRq : Show Blocked State task PC stack pid father SysRq : Resetting From kadlec at mail.kfki.hu Sat Mar 28 00:36:15 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Sat, 28 Mar 2009 01:36:15 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: On Sat, 28 Mar 2009, Kadlecsik Jozsef wrote: > On Fri, 27 Mar 2009, Bob Peterson wrote: > > > Perhaps you should change your post_fail_delay to some very high > > number, recreate the problem, and when it freezes force a > > sysrq-trigger to get call traces for all the processes. > > Then also you can look at the dmesg to see if there was a kernel > > panic or something on the node that would otherwise be > > immediately fenced. > > I enabled more kernel debugging, netconsole and captured the attaced > console log. I hope it gives the required info. I should get some sleep - but can't it be that I hit the potential deadlock mentioned here: commit 4787e11dc7831f42228b89ba7726fd6f6901a1e3 gfs-kmod: workaround for potential deadlock. Prefault user pages The bug uncovered in 461770 does not seem fixable without a massive change to how gfs works. There is a lock ordering mismatch between the process address space lock and the glocks. The only good way to avoid this in all cases is to not hold the glock for so long, which is what gfs2 does. This is impossible without completely changing how gfs does locking. Fortunately, this is only a problem when you have multiple processes sharing an address space, and are doing IO to a gfs file with a userspace buffer that's part of an mmapped gfs file. In this case, prefaulting the buffer's pages immediately before acquiring the glocks significantly shortens the window for this deadlock. Closing the window any more causes a large performance hit. Mailman do mmap files... Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From s.wendy.cheng at gmail.com Sat Mar 28 03:36:16 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Fri, 27 Mar 2009 22:36:16 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> > > I should get some sleep - but can't it be that I hit the potential > deadlock mentioned here: Please take my observation with a grain of salt (as I don't have Linux source code in front of me to check the exact locking sequence, nor can I afford spending time on this) ... I don't see a strong evidence of deadlock (but it could) from the thread backtraces However, assuming the cluster worked before, you could have overloaded the e1000 driver in this case. There are suspicious page faults but memory is very "ok". So one possibility is that GFS had generated too many sync requests that flooded the e1000. As the result, the cluster heart beat missed its interval. Do you have the same ethernet card for both AOE and cluster traffic ? If yes, seperate them to see how it goes. And of course, if you don't have Ben's mmap patch (as you described in your post), it is probably a good idea to get it into your gfs-kmod. But honestly, I think running GFS1 on newer kernels is a bad idea. -- Wendy > > commit 4787e11dc7831f42228b89ba7726fd6f6901a1e3 > > gfs-kmod: workaround for potential deadlock. Prefault user pages > > The bug uncovered in 461770 does not seem fixable without a massive > change to how gfs works. There is a lock ordering mismatch between > the process address space lock and the glocks. The only good way to > avoid this in all cases is to not hold the glock for so long, which > is what gfs2 does. This is impossible without completely changing > how gfs does locking. Fortunately, this is only a problem when you > have multiple processes sharing an address space, and are doing IO > to a gfs file with a userspace buffer that's part of an mmapped gfs > file. In this case, prefaulting the buffer's pages immediately > before acquiring the glocks significantly shortens the window for > this deadlock. Closing the window any more causes a large > performance hit. > > Mailman do mmap files... > > Best regards, > Jozsef > -- > E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu > PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt > Address: KFKI Research Institute for Particle and Nuclear Physics > H-1525 Budapest 114, POB. 49, Hungary > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kadlec at mail.kfki.hu Sat Mar 28 11:12:56 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Sat, 28 Mar 2009 12:12:56 +0100 (CET) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> Message-ID: Hi, On Fri, 27 Mar 2009, Wendy Cheng wrote: > > I should get some sleep - but can't it be that I hit the potential > > deadlock mentioned here: > > > > commit ?4787e11dc7831f42228b89ba7726fd6f6901a1e3 > > > > gfs-kmod: workaround for potential deadlock. Prefault user pages [...] > > file. In this case, prefaulting the buffer's pages immediately > > before acquiring the glocks significantly shortens the window > > for this deadlock. Closing the window any more causes a large > > performance hit. > > > > Mailman do mmap files... > I don't see a strong evidence of deadlock (but it could) from the thread > backtraces However, assuming the cluster worked before, you could have > overloaded the e1000 driver in this case. There are suspicious page faults > but memory is very "ok". So one possibility is that GFS had generated too > many sync requests that flooded the e1000. As the result, the cluster heart > beat missed its interval. It's a possibility. But it assumes also that the node freezes >because< it was fenced off. So far nothing indicates that. > Do you have the same ethernet card for both AOE and cluster traffic ? If > yes, seperate them to see how it goes. Yes, the AOE and cluster traffic shares the same ethernet card. However with the earlier release whatever high load we had, there was never any locking up, freezing problem. > And of course, if you don't have Ben's mmap patch (as you described in > your post), it is probably a good idea to get it into your gfs-kmod. The patch *is* in the cluster-2.03.11. The comment itself says, it shortens the window for the deadlock but does not eliminate that. As a possible workaround I moved mailman from GFS to a local disk, started it and there was no freeze. The cluster ran for almost seven hours, then two nodes died again :-( > But honestly,? I think running GFS1 on newer kernels is a bad idea. I see. So do you believe GFS2 is better/ready for production? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From s.wendy.cheng at gmail.com Sat Mar 28 16:07:18 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Sat, 28 Mar 2009 11:07:18 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> Message-ID: <49CE4B36.2000103@gmail.com> Kadlecsik Jozsef wrote: >> I don't see a strong evidence of deadlock (but it could) from the thread >> backtraces However, assuming the cluster worked before, you could have >> overloaded the e1000 driver in this case. There are suspicious page faults >> but memory is very "ok". So one possibility is that GFS had generated too >> many sync requests that flooded the e1000. As the result, the cluster heart >> beat missed its interval. >> > > It's a possibility. But it assumes also that the node freezes >because< > it was fenced off. So far nothing indicates that. > Re-read your console log. There are many foot-prints of spin_lock - that's worrisome. Hit a couple of "sysrq-w" next time when you have hangs, other than sysrq-t. This should give traces of the threads that are actively on CPUs at that time. Also check your kernel change log (to see whether GFS has any new patch that touches spin lock that doesn't in previous release). BTW, I do have opinions on other parts of your postings but don't have time to express them now. Maybe I'll say something when I finish my current chores :) ... Need to rush out now. Good luck on your debugging ! -- Wendy From s.wendy.cheng at gmail.com Sun Mar 29 04:05:30 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Sat, 28 Mar 2009 23:05:30 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <49CE4B36.2000103@gmail.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> Message-ID: <49CEF38A.1060202@gmail.com> Wendy Cheng wrote: > ..... [snip] ... There are many foot-prints of spin_lock - that's > worrisome. Hit a couple of "sysrq-w" next time when you have hangs, > other than sysrq-t. This should give traces of the threads that are > actively on CPUs at that time. Also check your kernel change log (to > see whether GFS has any new patch that touches spin lock that doesn't > in previous release). I re-read your console log few minutes ago, followed by a quick browse into cluster git tree. Few of python processes (e.g. pid 4104, 4105, etc) are blocked by locks within gfs_readdir(). This somehow relates to a performance patch committed on 11/6/2008. The gfs_getattr() has a piece of new code that touches vfs inode operation while glock is taken. That's an area that needs examination. I don't have linux kernel source handy to see whether that iput() and igrab() can lead to deadlock though. If you have the patch in your kernel and if you can, temporarily remove it (and rebuild the kernel) to see how it goes: commit a71b12b692cac3a4786241927227013bf2f3bf99 Again, take my advice with a grain of salt :) ...I'll stop here. Good luck ! -- Wendy From kadlec at mail.kfki.hu Sun Mar 29 18:38:41 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Sun, 29 Mar 2009 20:38:41 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <49CE4B36.2000103@gmail.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> Message-ID: Hi, On Sat, 28 Mar 2009, Wendy Cheng wrote: > Kadlecsik Jozsef wrote: > > > I don't see a strong evidence of deadlock (but it could) from the > > > thread backtraces However, assuming the cluster worked before, you > > > could have overloaded the e1000 driver in this case. There are > > > suspicious page faults but memory is very "ok". So one possibility > > > is that GFS had generated too many sync requests that flooded the > > > e1000. As the result, the cluster heart beat missed its interval. > > > > It's a possibility. But it assumes also that the node freezes >because< it > > was fenced off. So far nothing indicates that. > > Re-read your console log. There are many foot-prints of spin_lock - that's > worrisome. Hit a couple of "sysrq-w" next time when you have hangs, other > than sysrq-t. This should give traces of the threads that are actively on CPUs > at that time. Also check your kernel change log (to see whether GFS has any > new patch that touches spin lock that doesn't in previous release). I went through the git changelogs yesterday but could not spot anything suspicious, however I'm not a filesystem expert at all. The patch titled gfs-kernel: Bug 450209: Create gfs1-specific lock modules + minor fixes to build with 2.6.27 hit me hard as according to the description, it was *not* tested in cluster environmet when it did replace dlm behind gfs. I reached the decision and we downgraded - could not delay anymore: cluster-2.03.11 -> cluster-2.01.00 linux-2.6.27.21 -> linux-2.6.23.17 The e1000 and e1000e drivers are the newest ones. The aoe driver is from aoe6-59 because aoe6-69 does not support 2.6.23.17. We did not downgrade openais and LVM2. Tomorrow we'll move back mailman to GFS. There are three different netconsole log recordings at http://www.kfki.hu/~kadlec/gfs/, that's all I could do. If there'll be some patches I'll try to test it at one of the nodes but it can't be the one which runs the mailman queue manager and so far I could not find any other method to crash the system at will but to run it. That's a debugging problem to solve. > BTW, I do have opinions on other parts of your postings but don't have > time to express them now. Maybe I'll say something when I finish my > current chores :) I'd definitiely like to read your opinion! We'll reorganize one of the AOE blades by backing up the GFS volume and creating a smaller one to make space for a new GFS2 test volume. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From s.wendy.cheng at gmail.com Mon Mar 30 02:53:59 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Sun, 29 Mar 2009 21:53:59 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> Message-ID: <49D03447.2000901@gmail.com> Kadlecsik Jozsef wrote: > There are three different netconsole log recordings at > http://www.kfki.hu/~kadlec/gfs/ One of the new console logs has a good catch (netconsole0.txt): you *do* have a deadlock as the CPUs are spinning waiting for spin lock. This seems to be more to do with the changes made via bugzilla 466645. I think RHEL version that has the subject patch will have the same issue as well. -- Wendy From kadlec at mail.kfki.hu Mon Mar 30 07:43:42 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Mon, 30 Mar 2009 09:43:42 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <49D03447.2000901@gmail.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> <49D03447.2000901@gmail.com> Message-ID: Hi, On Sun, 29 Mar 2009, Wendy Cheng wrote: > Kadlecsik Jozsef wrote: > > There are three different netconsole log recordings at > > http://www.kfki.hu/~kadlec/gfs/ > One of the new console logs has a good catch (netconsole0.txt): you *do* have > a deadlock as the CPUs are spinning waiting for spin lock. This seems to be > more to do with the changes made via bugzilla 466645. I think RHEL version > that has the subject patch will have the same issue as well. You mean the part of the patch @@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct error = gfs_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &gh); if (!error) { generic_fillattr(inode, stat); + if (S_ISREG(inode->i_mode) && dentry->d_parent + && dentry->d_parent->d_inode) { + p_inode = igrab(dentry->d_parent->d_inode); + if (p_inode) { + pi = get_v2ip(p_inode); + pi->i_dir_stats++; + iput(p_inode); + } + } gfs_glock_dq_uninit(&gh); } might cause a deadlock: if the parent directory inode is already locked, then this part will wait infinitely to get the lock, isn't it? If I open a directory and then stat a file in it, is that enough to trigger the deadlock? [Shouldn't we move this thread to the cluster-devel list?] Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From kadlec at mail.kfki.hu Mon Mar 30 08:04:16 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Mon, 30 Mar 2009 10:04:16 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> <49D03447.2000901@gmail.com> Message-ID: On Mon, 30 Mar 2009, Kadlecsik Jozsef wrote: > On Sun, 29 Mar 2009, Wendy Cheng wrote: > > > Kadlecsik Jozsef wrote: > > > There are three different netconsole log recordings at > > > http://www.kfki.hu/~kadlec/gfs/ > > One of the new console logs has a good catch (netconsole0.txt): you *do* have > > a deadlock as the CPUs are spinning waiting for spin lock. This seems to be > > more to do with the changes made via bugzilla 466645. I think RHEL version > > that has the subject patch will have the same issue as well. > > You mean the part of the patch > > @@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct > error = gfs_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &gh); > if (!error) { > generic_fillattr(inode, stat); > + if (S_ISREG(inode->i_mode) && dentry->d_parent > + && dentry->d_parent->d_inode) { > + p_inode = igrab(dentry->d_parent->d_inode); > + if (p_inode) { > + pi = get_v2ip(p_inode); > + pi->i_dir_stats++; > + iput(p_inode); > + } > + } > gfs_glock_dq_uninit(&gh); > } > > might cause a deadlock: if the parent directory inode is already locked, > then this part will wait infinitely to get the lock, isn't it? > > If I open a directory and then stat a file in it, is that enough to > trigger the deadlock? No, that's too simple and should have came out much earlier, the patch is from Nov 6 2008. Something like creating files in a directory by one process and statting at the same time by another one, in a loop? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From carlopmart at gmail.com Mon Mar 30 08:40:44 2009 From: carlopmart at gmail.com (carlopmart) Date: Mon, 30 Mar 2009 10:40:44 +0200 Subject: [Linux-cluster] When new cman patch version will be released for RHEL 5.3?? Message-ID: <49D0858C.5090200@gmail.com> Hi all, Sombredy knows when cman will be patched to works ok when it is used with two nodes and quorum partitions?? Thanks ... -- CL Martinez carlopmart {at} gmail {d0t} com From denisb+gmane at gmail.com Mon Mar 30 08:47:25 2009 From: denisb+gmane at gmail.com (denis) Date: Mon, 30 Mar 2009 10:47:25 +0200 Subject: [Linux-cluster] Re: When new cman patch version will be released for RHEL 5.3?? In-Reply-To: <49D0858C.5090200@gmail.com> References: <49D0858C.5090200@gmail.com> Message-ID: carlopmart wrote: > Hi all, > > Sombredy knows when cman will be patched to works ok when it is used > with two nodes and quorum partitions?? > > Thanks ... Any kind of situation report on RHCS in 5.3 would be nice, I am trying as hard as I can to track bugzilla, this list and still I don't think I have a clear feeling as to what works and what is known broken in RHEL5.3 Cluster Suite.. Regards -- Denis Braekhus Team Lead Managed Services Redpill Linpro AS - Changing the game From swhiteho at redhat.com Mon Mar 30 09:17:57 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 30 Mar 2009 10:17:57 +0100 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> <49CD201A.309@gmail.com> Message-ID: <1238404677.3352.4.camel@localhost.localdomain> Hi, On Fri, 2009-03-27 at 20:02 +0100, Kadlecsik Jozsef wrote: > On Fri, 27 Mar 2009, Wendy Cheng wrote: > > > ... [snip] ... > > > Sigh. The pressure is mounting to fix the cluster at any cost, and nothing > > > remained but to downgrade to > > > cluster-2.01.00/openais-0.80.3 which would be just ridiculous. > > > > I have doubts that GFS (i.e. GFS1) is tuned and well-maintained on newer > > versions of RHCS (as well as 2.6 based kernels). My impression is that > > GFS1 is supposed to be phased out starting from RHEL 5. So if you are > > running with GFS1, why downgrading RHCS is ridiculous ? > > We'd need features added to recent 2.6 kernels (like read-only bindmount), > so the natural path was upgrading GFS1. However, as in the present state > our cluster is unstable, either we have to find the culprit or go back to > the proven version (and loosing the required new features). Read only bind mounts have not been tested with gfs1 and they might very well not work correctly, so be careful. Our general plan is to try and introduce new features into GFS2 and to maintain GFS with its existing feature set. Thats not to say that there will be no new features in GFS, but just that we are trying in general to put the new stuff in GFS2. Upgrading from GFS1 to GFS2 will always have to involve a shutdown in cluster operations since the differing journalling schemes rule out a node by node in place upgrade I'm afraid, Steve. From kadlec at mail.kfki.hu Mon Mar 30 09:20:25 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Mon, 30 Mar 2009 11:20:25 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <1238404677.3352.4.camel@localhost.localdomain> References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> <49CD201A.309@gmail.com> <1238404677.3352.4.camel@localhost.localdomain> Message-ID: Hi, On Mon, 30 Mar 2009, Steven Whitehouse wrote: > > We'd need features added to recent 2.6 kernels (like read-only bindmount), > > so the natural path was upgrading GFS1. However, as in the present state > > our cluster is unstable, either we have to find the culprit or go back to > > the proven version (and loosing the required new features). > > Read only bind mounts have not been tested with gfs1 and they might very > well not work correctly, so be careful. Our general plan is to try and > introduce new features into GFS2 and to maintain GFS with its existing > feature set. Thats not to say that there will be no new features in GFS, > but just that we are trying in general to put the new stuff in GFS2. We do not need read-only bind mounts in GFS itself but in local filesystems. > Upgrading from GFS1 to GFS2 will always have to involve a shutdown in > cluster operations since the differing journalling schemes rule out a > node by node in place upgrade I'm afraid, That's fully acceptable. I assume there's no problem in running GFS1 and GFS2 volumes in parallel? Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From swhiteho at redhat.com Mon Mar 30 09:47:30 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 30 Mar 2009 10:47:30 +0100 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <096401c9ae82$ac21d650$046582f0$@yarwood@juno.co.uk> <005401c9aeee$56f954c0$04ebfe40$@f2s.com> <49CD201A.309@gmail.com> <1238404677.3352.4.camel@localhost.localdomain> Message-ID: <1238406450.3352.6.camel@localhost.localdomain> Hi, On Mon, 2009-03-30 at 11:20 +0200, Kadlecsik Jozsef wrote: > Hi, > > On Mon, 30 Mar 2009, Steven Whitehouse wrote: > > > > We'd need features added to recent 2.6 kernels (like read-only bindmount), > > > so the natural path was upgrading GFS1. However, as in the present state > > > our cluster is unstable, either we have to find the culprit or go back to > > > the proven version (and loosing the required new features). > > > > Read only bind mounts have not been tested with gfs1 and they might very > > well not work correctly, so be careful. Our general plan is to try and > > introduce new features into GFS2 and to maintain GFS with its existing > > feature set. Thats not to say that there will be no new features in GFS, > > but just that we are trying in general to put the new stuff in GFS2. > > We do not need read-only bind mounts in GFS itself but in local > filesystems. Ok, that should be fine then. > > > Upgrading from GFS1 to GFS2 will always have to involve a shutdown in > > cluster operations since the differing journalling schemes rule out a > > node by node in place upgrade I'm afraid, > > That's fully acceptable. I assume there's no problem in running GFS1 and > GFS2 volumes in parallel? > > Best regards, > Jozsef No, they are designed to be able to run in parallel, so there should be no issues doing that, Steve. From redhat-list at mail.ru Mon Mar 30 09:48:02 2009 From: redhat-list at mail.ru (redhat redhat) Date: Mon, 30 Mar 2009 13:48:02 +0400 Subject: [Linux-cluster] (no subject) Message-ID: Problem Cluster Status Quorum Disk is Offline [root at cgate1 /]# service qdiskd status qdiskd (pid 25892) is running... [root at cgate2 ~]# service qdiskd status qdiskd (pid 29673) is running... # clustat Cluster Status for CG1 @ Mon Mar 30 13:14:17 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ cgate1.ferma.ru 1 Online, Local cgate2.ferma.ru 2 Online /dev/sda1 0 Offline, Quorum Disk # cman_tool status Version: 6.1.0 Config Version: 7 Cluster Name: CG1 Cluster Id: 459 Cluster Member: Yes Cluster Generation: 32 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Quorum: 2 Active subsystems: 9 Flags: Dirty Ports Bound: 0 11 Node name: cgate1.ferma.ru Node ID: 1 Multicast addresses: 239.192.1.204 Node addresses: 172.22.192.131 From redhat-list at mail.ru Mon Mar 30 09:50:59 2009 From: redhat-list at mail.ru (redhat redhat) Date: Mon, 30 Mar 2009 13:50:59 +0400 Subject: [Linux-cluster] Problem - Quorum Disk is Offline Message-ID: Problem Cluster Status Quorum Disk is Offline [root at cgate1 /]# service qdiskd status qdiskd (pid 25892) is running... [root at cgate2 ~]# service qdiskd status qdiskd (pid 29673) is running... # clustat Cluster Status for CG1 @ Mon Mar 30 13:14:17 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ cgate1.ferma.ru 1 Online, Local cgate2.ferma.ru 2 Online /dev/sda1 0 Offline, Quorum Disk # cman_tool status Version: 6.1.0 Config Version: 7 Cluster Name: CG1 Cluster Id: 459 Cluster Member: Yes Cluster Generation: 32 Membership state: Cluster-Member Nodes: 2 Expected votes: 2 Total votes: 2 Quorum: 2 Active subsystems: 9 Flags: Dirty Ports Bound: 0 11 Node name: cgate1.ferma.ru Node ID: 1 Multicast addresses: 239.192.1.204 Node addresses: 172.22.192.131 From paul at dugas.cc Mon Mar 30 11:06:08 2009 From: paul at dugas.cc (Paul Dugas) Date: Mon, 30 Mar 2009 11:06:08 +0000 Subject: [Linux-cluster] Cluster Networks Message-ID: <78080460-1238411174-cardhu_decombobulator_blackberry.rim.net-2012931512-@bxe1191.bisx.prod.on.blackberry> I've a few machines sharing a couple GFS/LVM volumes that are physically on an AOE device. Each machine has two network interfaces; LAN and the AOE SAN. I don't have IP addresses on the SAN interfaces so the cluster is communicating via the LAN. Is this ideal or should I configure them to use the SAN interfaces instead? Paul __ Paul Dugas -- 404.932.1355 -- paul at dugas.cc From shang at ubuntu.com Mon Mar 30 12:40:19 2009 From: shang at ubuntu.com (Shang Wu) Date: Mon, 30 Mar 2009 08:40:19 -0400 Subject: [Linux-cluster] Problem - Quorum Disk is Offline In-Reply-To: References: Message-ID: 2009/3/30 redhat redhat : > Problem > Cluster Status Quorum Disk is Offline > > > [root at cgate1 /]# service qdiskd status > qdiskd (pid 25892) is running... > > [root at cgate2 ~]# service qdiskd status > qdiskd (pid 29673) is running... > > # clustat > Cluster Status for CG1 @ Mon Mar 30 13:14:17 2009 > Member Status: Quorate > > ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID ? Status > ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?---- ------ > ?cgate1.ferma.ru ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 Online, Local > ?cgate2.ferma.ru ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?2 Online > ?/dev/sda1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?0 Offline, Quorum Disk > > > # cman_tool status > Version: 6.1.0 > Config Version: 7 > Cluster Name: CG1 > Cluster Id: 459 > Cluster Member: Yes > Cluster Generation: 32 > Membership state: Cluster-Member > Nodes: 2 > Expected votes: 2 > Total votes: 2 > Quorum: 2 > Active subsystems: 9 > Flags: Dirty > Ports Bound: 0 11 > Node name: cgate1.ferma.ru > Node ID: 1 > Multicast addresses: 239.192.1.204 > Node addresses: 172.22.192.131 > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > What does it say in /var/log/messages file? -- Shang Wu ---------------------------------------------------------------- Public Key: keyserver.ubuntu.com Key ID: 4B2BCA02 Fingerprint: 4832 D5D0 D124 CE1D FD07 167A 3E93 FF44 4B2B CA02 From redhat-list at mail.ru Mon Mar 30 13:22:24 2009 From: redhat-list at mail.ru (redhat redhat) Date: Mon, 30 Mar 2009 17:22:24 +0400 Subject: [Linux-cluster] Problem - Quorum Disk is Offline Message-ID: Mar 30 17:09:44 cgate1 openais[2985]: [CMAN ] quorum device unregistered Mar 30 17:09:48 cgate1 qdiskd[6476]: Quorum Daemon Initializing -----Original Message----- From: redhat redhat To: Shang Wu Date: Mon, 30 Mar 2009 17:12:08 +0400 Subject: Re[2]: [Linux-cluster] Problem - Quorum Disk is Offline > Mar 30 17:09:44 cgate1 openais[2985]: [CMAN ] quorum device unregistered > Mar 30 17:09:48 cgate1 qdiskd[6476]: Quorum Daemon Initializing > > > -----Original Message----- > From: Shang Wu > To: redhat redhat , linux clustering > Date: Mon, 30 Mar 2009 08:40:19 -0400 > Subject: Re: [Linux-cluster] Problem - Quorum Disk is Offline > > > 2009/3/30 redhat redhat : > > > Problem > > > Cluster Status Quorum Disk is Offline > > > > > > > > > [root at cgate1 /]# service qdiskd status > > > qdiskd (pid 25892) is running... > > > > > > [root at cgate2 ~]# service qdiskd status > > > qdiskd (pid 29673) is running... > > > > > > # clustat > > > Cluster Status for CG1 @ Mon Mar 30 13:14:17 2009 > > > Member Status: Quorate > > > > > > Member Name ID Status > > > ------ ---- ---- ------ > > > cgate1.ferma.ru 1 Online, Local > > > cgate2.ferma.ru 2 Online > > > /dev/sda1 0 Offline, Quorum Disk > > > > > > > > > # cman_tool status > > > Version: 6.1.0 > > > Config Version: 7 > > > Cluster Name: CG1 > > > Cluster Id: 459 > > > Cluster Member: Yes > > > Cluster Generation: 32 > > > Membership state: Cluster-Member > > > Nodes: 2 > > > Expected votes: 2 > > > Total votes: 2 > > > Quorum: 2 > > > Active subsystems: 9 > > > Flags: Dirty > > > Ports Bound: 0 11 > > > Node name: cgate1.ferma.ru > > > Node ID: 1 > > > Multicast addresses: 239.192.1.204 > > > Node addresses: 172.22.192.131 > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > What does it say in /var/log/messages file? > > > > -- > > Shang Wu > > ---------------------------------------------------------------- > > Public Key: keyserver.ubuntu.com > > Key ID: 4B2BCA02 > > Fingerprint: 4832 D5D0 D124 CE1D FD07 167A 3E93 FF44 4B2B CA02 > > > > From s.wendy.cheng at gmail.com Mon Mar 30 16:00:03 2009 From: s.wendy.cheng at gmail.com (Wendy Cheng) Date: Mon, 30 Mar 2009 11:00:03 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> <49D03447.2000901@gmail.com> Message-ID: <49D0EC83.4030202@gmail.com> Kadlecsik Jozsef wrote: > >> You mean the part of the patch >> >> @@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct >> error = gfs_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &gh); >> if (!error) { >> generic_fillattr(inode, stat); >> + if (S_ISREG(inode->i_mode) && dentry->d_parent >> + && dentry->d_parent->d_inode) { >> + p_inode = igrab(dentry->d_parent->d_inode); >> + if (p_inode) { >> + pi = get_v2ip(p_inode); >> + pi->i_dir_stats++; >> + iput(p_inode); >> + } >> + } >> gfs_glock_dq_uninit(&gh); >> } >> >> might cause a deadlock: if the parent directory inode is already locked, >> then this part will wait infinitely to get the lock, isn't it? >> >> If I open a directory and then stat a file in it, is that enough to >> trigger the deadlock? >> > > No, that's too simple and should have came out much earlier, the patch is > from Nov 6 2008. Something like creating files in a directory by one > process and statting at the same time by another one, in a loop? > > It would be a shame if GFS(1/2) ends up losing you as a user - not many users can delve into the bits and bytes like you. My suggestion is that you work directly with GFS engineers, particularly the one who submitted this patch. He is bright and hardworking - one of the best among young engineers within Red Hat. This patch is a good "start" to get into the root cause (as gfs readdir is hung on *every* console logs you generated). Maybe a bugzilla would be a good start ? I really can't keep spending time on this. As Monday arrives, I'm always behind few of my tasks.... -- Wendy From jeff.sturm at eprize.com Mon Mar 30 16:43:28 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Mon, 30 Mar 2009 12:43:28 -0400 Subject: [Linux-cluster] Cluster Networks In-Reply-To: <78080460-1238411174-cardhu_decombobulator_blackberry.rim.net-2012931512-@bxe1191.bisx.prod.on.blackberry> References: <78080460-1238411174-cardhu_decombobulator_blackberry.rim.net-2012931512-@bxe1191.bisx.prod.on.blackberry> Message-ID: <64D0546C5EBBD147B75DE133D798665F021BA6A9@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paul Dugas > Sent: Monday, March 30, 2009 7:06 AM > To: Linux-Cluster Mailing List > Subject: [Linux-cluster] Cluster Networks > > I've a few machines sharing a couple GFS/LVM volumes that are > physically on an AOE device. Each machine has two network > interfaces; LAN and the AOE SAN. I don't have IP addresses > on the SAN interfaces so the cluster is communicating via the LAN. > > Is this ideal or should I configure them to use the SAN > interfaces instead? It depends. Is it your wish to maximize throughput or availability? One consideration is MTU. Given a standard blocksize of 4k on Linux, AoE initiators benefit from jumbo frames, since a complete block can be delivered in one packet. On the other hand, packets from openais/lock_dlm are generally quite small and do not fragment in a normal MTU. If you are able to run jumbo frames on all your network interfaces, AoE can use any interface and benefit from the extra thoughput. If however your switch ports are not configured for jumbo frames, you may be better off keeping separate interfaces for the two, unless the additional throughput isn't important to you. For maximum uptime, you can multipath AoE over two interfaces, so that if a single interface were to fail, traffic will resume on the other. Multipath isn't available for openais (I believe it is implemented but not supported) but you can run a bonded ethernet interface to achieve similar results. An active/passive bonded pair connected to two separate switches would give you protection from failure of a single switch/cable/iface, which is very nice for a cluster, because you can design the network for no single point of failure (depending also on your power configuration). If you can run both the SAN/LAN on jumbo frames, and multipath AoE, you can get very nice throughput. With the latest AoE driver, an updated e1000 driver, and some network tuning, we can sustain 190MB/s AoE transfers on our test network. Jeff From teigland at redhat.com Mon Mar 30 16:42:16 2009 From: teigland at redhat.com (David Teigland) Date: Mon, 30 Mar 2009 11:42:16 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: Message-ID: <20090330164216.GA6135@redhat.com> On Fri, Mar 27, 2009 at 06:19:50PM +0100, Kadlecsik Jozsef wrote: > Hi, > > Combing through the log files I found the following: > > Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec post_fail_delay > Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs" > Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1??e1?? > Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success > > The line saying "can't get node number for node e1??e1??" might be > innocent, but looks suspicious. Why fenced could not get the victim name? I've not seen that before, and I can't explain either how cman_get_node() could have failed or why it printed a garbage string. It's a non-essential bit of code, so that error should not be related to your problem. Dave From Paul at dugas.cc Mon Mar 30 17:56:08 2009 From: Paul at dugas.cc (Paul Dugas) Date: Mon, 30 Mar 2009 13:56:08 -0400 Subject: [Linux-cluster] Cluster Networks In-Reply-To: <64D0546C5EBBD147B75DE133D798665F021BA6A9@hugo.eprize.local> References: <78080460-1238411174-cardhu_decombobulator_blackberry.rim.net-2012931512-@bxe1191.bisx.prod.on.blackberry> <64D0546C5EBBD147B75DE133D798665F021BA6A9@hugo.eprize.local> Message-ID: <1238435768.2483.57.camel@ltpad.dugas.lan> On Mon, 2009-03-30 at 12:43 -0400, Jeff Sturm wrote: > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paul Dugas > > Sent: Monday, March 30, 2009 7:06 AM > > To: Linux-Cluster Mailing List > > Subject: [Linux-cluster] Cluster Networks > > > > I've a few machines sharing a couple GFS/LVM volumes that are > > physically on an AOE device. Each machine has two network > > interfaces; LAN and the AOE SAN. I don't have IP addresses > > on the SAN interfaces so the cluster is communicating via the LAN. > > > > Is this ideal or should I configure them to use the SAN > > interfaces instead? > > It depends. Is it your wish to maximize throughput or availability? > > One consideration is MTU. Given a standard blocksize of 4k on Linux, > AoE initiators benefit from jumbo frames, since a complete block can be > delivered in one packet. On the other hand, packets from > openais/lock_dlm are generally quite small and do not fragment in a > normal MTU. > > If you are able to run jumbo frames on all your network interfaces, AoE > can use any interface and benefit from the extra thoughput. If however > your switch ports are not configured for jumbo frames, you may be better > off keeping separate interfaces for the two, unless the additional > throughput isn't important to you. > > For maximum uptime, you can multipath AoE over two interfaces, so that > if a single interface were to fail, traffic will resume on the other. > Multipath isn't available for openais (I believe it is implemented but > not supported) but you can run a bonded ethernet interface to achieve > similar results. An active/passive bonded pair connected to two > separate switches would give you protection from failure of a single > switch/cable/iface, which is very nice for a cluster, because you can > design the network for no single point of failure (depending also on > your power configuration). > > If you can run both the SAN/LAN on jumbo frames, and multipath AoE, you > can get very nice throughput. With the latest AoE driver, an updated > e1000 driver, and some network tuning, we can sustain 190MB/s AoE > transfers on our test network. Availability is not my concern here but I appreciate the info. I maintain a physically separate SAN (separate switch) for the AoE traffic with jumbo frames enabled and those interfaces are already doubled up supporting AoE multipath. That network is all Etherent, no IP. My question is more aimed at cluster stability and consistency. Members are monitoring each other via IP traffic over their LAN interfaces and I'm wondering if that is the correct way for the cluster to operate. I'm not familiar with the internals of the clustering software at all but I had a thought that since the cluster is solely in place to utilize shared GFS volumes, is it best that they monitor each other via the same network they access the volumes over? Or, is it correct that they utilize the same LAN network that clients of the cluster are on? It would be simple to setup an IP subnet for the SAN and adjust the cluster configs to use those names/addresses instead. I'm just wondering if that's the "correct" way to do this. Thanks again, Paul -- Paul Dugas - paul at dugas.cc - 404.932.1355 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: From teigland at redhat.com Mon Mar 30 18:07:40 2009 From: teigland at redhat.com (David Teigland) Date: Mon, 30 Mar 2009 13:07:40 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: Message-ID: <20090330180739.GB6135@redhat.com> On Thu, Mar 26, 2009 at 11:47:00PM +0100, Kadlecsik Jozsef wrote: > Hi, > > Freshly built cluster-2.03.11 reproducibly freezes as mailman started. > The versions are: > > linux-2.6.27.21 > cluster-2.03.11 > openais from svn, subrev 1152 version 0.80 So, in summary: - nodes 1-5 are correctly forming a cluster, and appear to be stable - nodes 1-5 all correctly mount the gfs file system - node5 runs: init.d/mailman start - node5 "freezes completely" - node5 is fenced by another node, e.g. node4 - sometimes, node4 then freezes completely You're using STABLE2 code, which is equivalent to RHEL5 code *except* for the gfs-kernel patches that are necessary to make gfs run on recent kernels. The RHEL5 code is thoroughly tested, but the STABLE2 code is not, so any differences between them (i.e. the gfs-kernel patches for recent kernels) are the most likely causes for regression bugs. It's always possible that a patch like the one in bz 466645 could be responsible, but it's less likely since it does go through a QE process unlike the patches for kernel updates. Hopefully, some gfs developers can look at the backtraces (which as Wendy points out do look suspicious) and try to reproduce this problem with recent kernels. Aside from gfs, the fact that you're running AoE over the same network at openais does raise some flags. We've seen problems with openais in the past when block i/o is sent over the same network causing load problems. It seems unlikely to be your problem, though, since it works fine with the previous version, and the freezing symptoms aren't what we'd expect to see from openais trouble. Dave From jeff.sturm at eprize.com Mon Mar 30 18:23:12 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Mon, 30 Mar 2009 14:23:12 -0400 Subject: [Linux-cluster] Cluster Networks In-Reply-To: <1238435768.2483.57.camel@ltpad.dugas.lan> References: <78080460-1238411174-cardhu_decombobulator_blackberry.rim.net-2012931512-@bxe1191.bisx.prod.on.blackberry> <64D0546C5EBBD147B75DE133D798665F021BA6A9@hugo.eprize.local> <1238435768.2483.57.camel@ltpad.dugas.lan> Message-ID: <64D0546C5EBBD147B75DE133D798665F021BA6AD@hugo.eprize.local> > -----Original Message----- > From: Paul Dugas [mailto:Paul at dugas.cc] > Sent: Monday, March 30, 2009 1:56 PM > To: Jeff Sturm > Cc: linux clustering > Subject: RE: [Linux-cluster] Cluster Networks > > ... > > It would be simple to setup an IP subnet for the SAN and > adjust the cluster configs to use those names/addresses > instead. I'm just wondering if that's the "correct" way to do this. Then the short answer is "no". The shared block storage required for a typical GFS setup doesn't need to share any physical interface with the cluster software whatsoever. (If you had a FC SAN in place instead of AoE, indeed there would be no ethernet on the SAN link to share.) Jeff From adas at redhat.com Mon Mar 30 18:45:08 2009 From: adas at redhat.com (Abhijith Das) Date: Mon, 30 Mar 2009 13:45:08 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <49D0EC83.4030202@gmail.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> <49D03447.2000901@gmail.com> <49D0EC83.4030202@gmail.com> Message-ID: <49D11334.5030406@redhat.com> Wendy Cheng wrote: > Kadlecsik Jozsef wrote: > >>> You mean the part of the patch >>> >>> @@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct >>> error = gfs_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_ANY, &gh); >>> if (!error) { >>> generic_fillattr(inode, stat); >>> + if (S_ISREG(inode->i_mode) && dentry->d_parent >>> + && dentry->d_parent->d_inode) { >>> + p_inode = igrab(dentry->d_parent->d_inode); >>> + if (p_inode) { >>> + pi = get_v2ip(p_inode); >>> + pi->i_dir_stats++; >>> + iput(p_inode); >>> + } >>> + } >>> gfs_glock_dq_uninit(&gh); >>> } >>> >>> might cause a deadlock: if the parent directory inode is already locked, >>> then this part will wait infinitely to get the lock, isn't it? >>> >>> If I open a directory and then stat a file in it, is that enough to >>> trigger the deadlock? >>> >>> >> No, that's too simple and should have came out much earlier, the patch is >> from Nov 6 2008. Something like creating files in a directory by one >> process and statting at the same time by another one, in a loop? >> >> >> > > It would be a shame if GFS(1/2) ends up losing you as a user - not many > users can delve into the bits and bytes like you. > > My suggestion is that you work directly with GFS engineers, particularly > the one who submitted this patch. He is bright and hardworking - one of > the best among young engineers within Red Hat. This patch is a good > "start" to get into the root cause (as gfs readdir is hung on *every* > console logs you generated). Maybe a bugzilla would be a good start ? Jozsef, Could you remove the patch associated with bz 466645 and see if you can hit the hang again? I've looked at the patch and I can't spot anything obvious. If this patch is causing your problems, I'll work on reproducing the problem on my setup here and try to fix it. Thanks --Abhi From mockey.chen at nsn.com Tue Mar 31 04:58:46 2009 From: mockey.chen at nsn.com (Chen Ming) Date: Tue, 31 Mar 2009 12:58:46 +0800 Subject: [Linux-cluster] Can same cluster name in same subnet? Message-ID: <49D1A306.5080804@nsn.com> Hi, I setup two cluster in same subnet, using the same name, it seems the later cluster can not startup. I try to start cman but failed. My environment is RHEL 5.3. My question: Is it possible to use the same cluster name in same subnet? Best Regards. Chen Ming From vu at sivell.com Tue Mar 31 06:10:20 2009 From: vu at sivell.com (Vu Pham) Date: Tue, 31 Mar 2009 01:10:20 -0500 Subject: [Linux-cluster] Can same cluster name in same subnet? In-Reply-To: <49D1A306.5080804@nsn.com> References: <49D1A306.5080804@nsn.com> Message-ID: <49D1B3CC.8010800@sivell.com> Chen Ming wrote: > Hi, > > I setup two cluster in same subnet, using the same name, it seems the > later cluster can not startup. I try to start cman but failed. My > environment is RHEL 5.3. > > My question: Is it possible to use the same cluster name in same subnet? > From the man page of cman ----------------- Cluster ID The cluster ID number is used to isolate clusters in the same subnet. Usually it is generated from a hash of the cluster name, but it can be overridden here if you feel the need. Sometimes cluster names can hash to the same ID. ----------------- So I think you either have to use different names or to explicitly specify different cluster_id. But why do you have to have the same cluster name in the first place ? Vu From sdake at redhat.com Tue Mar 31 06:12:55 2009 From: sdake at redhat.com (Steven Dake) Date: Mon, 30 Mar 2009 23:12:55 -0700 Subject: [Linux-cluster] Can same cluster name in same subnet? In-Reply-To: <49D1B3CC.8010800@sivell.com> References: <49D1A306.5080804@nsn.com> <49D1B3CC.8010800@sivell.com> Message-ID: <1238479975.29797.3.camel@sdake-laptop> You can have the same cluster name and override the multicast address as well. The multicast address is the unique identifier for the cluster, not the cluster name. The cluster name hashes into a multicast address in a deterministic fashion on all nodes with the same cluster id. This means they will conflict if they have a differing configuration. Regards -steve On Tue, 2009-03-31 at 01:10 -0500, Vu Pham wrote: > Chen Ming wrote: > > Hi, > > > > I setup two cluster in same subnet, using the same name, it seems the > > later cluster can not startup. I try to start cman but failed. My > > environment is RHEL 5.3. > > > > My question: Is it possible to use the same cluster name in same subnet? > > > > From the man page of cman > ----------------- > Cluster ID > The cluster ID number is used to isolate clusters in the same > subnet. > Usually it is generated from a hash of the cluster name, but it > can be > overridden here if you feel the need. Sometimes cluster names > can hash > to the same ID. > > > > ----------------- > > So I think you either have to use different names or to explicitly > specify different cluster_id. > > But why do you have to have the same cluster name in the first place ? > > Vu > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From vu at sivell.com Tue Mar 31 06:18:45 2009 From: vu at sivell.com (Vu Pham) Date: Tue, 31 Mar 2009 01:18:45 -0500 Subject: [Linux-cluster] Can same cluster name in same subnet? In-Reply-To: <1238479975.29797.3.camel@sdake-laptop> References: <49D1A306.5080804@nsn.com> <49D1B3CC.8010800@sivell.com> <1238479975.29797.3.camel@sdake-laptop> Message-ID: <49D1B5C5.5010807@sivell.com> Steven Dake wrote: > You can have the same cluster name and override the multicast address as > well. The multicast address is the unique identifier for the cluster, > not the cluster name. The cluster name hashes into a multicast address > in a deterministic fashion on all nodes with the same cluster id. This > means they will conflict if they have a differing configuration. Steve, thanks for this information. Interesting !!! Vu > > Regards > -steve > > On Tue, 2009-03-31 at 01:10 -0500, Vu Pham wrote: >> Chen Ming wrote: >>> Hi, >>> >>> I setup two cluster in same subnet, using the same name, it seems the >>> later cluster can not startup. I try to start cman but failed. My >>> environment is RHEL 5.3. >>> >>> My question: Is it possible to use the same cluster name in same subnet? >>> >> From the man page of cman >> ----------------- >> Cluster ID >> The cluster ID number is used to isolate clusters in the same >> subnet. >> Usually it is generated from a hash of the cluster name, but it >> can be >> overridden here if you feel the need. Sometimes cluster names >> can hash >> to the same ID. >> >> >> >> ----------------- >> >> So I think you either have to use different names or to explicitly >> specify different cluster_id. >> >> But why do you have to have the same cluster name in the first place ? >> >> Vu >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From brettcave at gmail.com Tue Mar 31 07:01:15 2009 From: brettcave at gmail.com (Brett Cave) Date: Tue, 31 Mar 2009 09:01:15 +0200 Subject: [Linux-cluster] explanation of how disaallowed nodes works Message-ID: hi, On a 6 node cluster, 2 nodes (1 & 6) were fenced. On coming back up, the 2 nodes were not able to start the cman service. All the other nodes have activity blocked. Disallowed nodes are (from cman_tool status) node2: 3,4,5 node3: 2,4,5 node4: 2,3,5 node5: 2,3,4 node1 & node6 - cman not running. Am using qdisk, and all running nodes have the disallowed list flagged as "d" - disallowed. Each node then also has: X (not a cluster member) for qdisk and the 2 fenced nodes that cman will not start on. d (on the 3 running nodes other than current) M (on the self-node - i.e. if run on node2, then node2 = M) This is what I get in logs when I try start cman on 1 of the X nodes... openais[10465]: CMAN: Joined a cluster with disallowed nodes. must die I cant get the nodes to restart cman - "service gfs stop" to unmount gfs mounts hangs... the following process is not able to complete: /sbin/umount.gfs /my/mountpoint1 Is there a way to get the cluster to recover from this? Going to be fencing all the nodes now to get the system up. From ccaulfie at redhat.com Tue Mar 31 07:29:45 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Tue, 31 Mar 2009 08:29:45 +0100 Subject: [Linux-cluster] Can same cluster name in same subnet? In-Reply-To: <49D1B3CC.8010800@sivell.com> References: <49D1A306.5080804@nsn.com> <49D1B3CC.8010800@sivell.com> Message-ID: <49D1C669.4030906@redhat.com> Vu Pham wrote: > > Chen Ming wrote: >> Hi, >> >> I setup two cluster in same subnet, using the same name, it seems the >> later cluster can not startup. I try to start cman but failed. My >> environment is RHEL 5.3. >> >> My question: Is it possible to use the same cluster name in same subnet? >> > > From the man page of cman > ----------------- > Cluster ID > The cluster ID number is used to isolate clusters in the same > subnet. > Usually it is generated from a hash of the cluster name, but it > can be > overridden here if you feel the need. Sometimes cluster names > can hash > to the same ID. > > > > ----------------- > > So I think you either have to use different names or to explicitly > specify different cluster_id. > > But why do you have to have the same cluster name in the first place ? Bear in mind that using the same cluster name is highly dangerous if you are using GFS with storage that is shared between the clusters. GFS checks the cluster name before mounting and if it matches the current cluster assumes it's OK to carry on. If two clusters have the same name then that means two independent clusters could mount the same GFS volume. The result of that would be a totally corrupted filesystem. Chrissie From ccaulfie at redhat.com Tue Mar 31 07:31:18 2009 From: ccaulfie at redhat.com (Chrissie Caulfield) Date: Tue, 31 Mar 2009 08:31:18 +0100 Subject: [Linux-cluster] explanation of how disaallowed nodes works In-Reply-To: References: Message-ID: <49D1C6C6.4080100@redhat.com> Brett Cave wrote: > hi, > > On a 6 node cluster, 2 nodes (1 & 6) were fenced. On coming back up, > the 2 nodes were not able to start the cman service. > > All the other nodes have activity blocked. Disallowed nodes are (from > cman_tool status) > node2: 3,4,5 > node3: 2,4,5 > node4: 2,3,5 > node5: 2,3,4 > > node1 & node6 - cman not running. > > Am using qdisk, and all running nodes have the disallowed list flagged > as "d" - disallowed. > Each node then also has: > X (not a cluster member) for qdisk and the 2 fenced nodes that cman > will not start on. > d (on the 3 running nodes other than current) > M (on the self-node - i.e. if run on node2, then node2 = M) > > > This is what I get in logs when I try start cman on 1 of the X nodes... > openais[10465]: CMAN: Joined a cluster with disallowed nodes. must die > > > I cant get the nodes to restart cman - "service gfs stop" to unmount > gfs mounts hangs... the following process is not able to complete: > /sbin/umount.gfs /my/mountpoint1 > > Is there a way to get the cluster to recover from this? Going to be > fencing all the nodes now to get the system up. The cman_tool man page has some detail on disallowed mode. But also check the version. cman in RHEL5.3 has a bug that can cause this to happen. I believe a hot fix is in the works somewhere... Chrissie From kadlec at mail.kfki.hu Tue Mar 31 09:18:51 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Tue, 31 Mar 2009 11:18:51 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <20090330164216.GA6135@redhat.com> References: <20090330164216.GA6135@redhat.com> Message-ID: On Mon, 30 Mar 2009, David Teigland wrote: > On Fri, Mar 27, 2009 at 06:19:50PM +0100, Kadlecsik Jozsef wrote: > > > > Combing through the log files I found the following: > > > > Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec post_fail_delay > > Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs" > > Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1??e1?? > > Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success > > > > The line saying "can't get node number for node e1??e1??" might be > > innocent, but looks suspicious. Why fenced could not get the victim name? > > I've not seen that before, and I can't explain either how cman_get_node() > could have failed or why it printed a garbage string. It's a non-essential > bit of code, so that error should not be related to your problem. Yes, it is surely not related to the freeze, but disturbing. Hm, in the function dispatch_fence_agent there's an ordering issue, I believe. The variable victim_nodename is freed but update_cman is called with variable victim pointing to the just freed victim_nodename. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From kadlec at mail.kfki.hu Tue Mar 31 09:50:45 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Tue, 31 Mar 2009 11:50:45 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <20090330180739.GB6135@redhat.com> References: <20090330180739.GB6135@redhat.com> Message-ID: On Mon, 30 Mar 2009, David Teigland wrote: > On Thu, Mar 26, 2009 at 11:47:00PM +0100, Kadlecsik Jozsef wrote: > > > > Freshly built cluster-2.03.11 reproducibly freezes as mailman started. > > The versions are: > > > > linux-2.6.27.21 > > cluster-2.03.11 > > openais from svn, subrev 1152 version 0.80 > > So, in summary: > - nodes 1-5 are correctly forming a cluster, and appear to be stable > - nodes 1-5 all correctly mount the gfs file system > - node5 runs: init.d/mailman start > - node5 "freezes completely" > - node5 is fenced by another node, e.g. node4 > - sometimes, node4 then freezes completely Yes, exactly. The freeze can reliably be triggered by starting mailman, but it can occur (and did) otherwise as well. The fact that node4 sometimes freezes too does not related to the fact that it fenced off node5. > You're using STABLE2 code, which is equivalent to RHEL5 code *except* > for the gfs-kernel patches that are necessary to make gfs run on recent > kernels. The RHEL5 code is thoroughly tested, but the STABLE2 code is > not, so any differences between them (i.e. the gfs-kernel patches for > recent kernels) are the most likely causes for regression bugs. That's bad because then one cannot check it by just removing patches - the kernel must be changed as well. > It's always possible that a patch like the one in bz 466645 could be > responsible, but it's less likely since it does go through a QE process > unlike the patches for kernel updates. I'll try to find a reliable way to crash the kernel without mailman. That'd make easier the bug-hunting. > Aside from gfs, the fact that you're running AoE over the same network > at openais does raise some flags. We've seen problems with openais in > the past when block i/o is sent over the same network causing load > problems. It seems unlikely to be your problem, though, since it works > fine with the previous version, and the freezing symptoms aren't what > we'd expect to see from openais trouble. AoE and openais seem to work side-by-side just fine. I can imagine that iSCSI and openais have got more trouble because iSCSI is much more heavy weighted than AoE. We pondered a lot over the setup, but decided to go with it and so far it resulted no problem. (Not absolutely true, AoE ethernet interface coming up with speed 10Mbps(!) instead of 10000 can practically kill AoE ;-). Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From kadlec at mail.kfki.hu Tue Mar 31 09:54:30 2009 From: kadlec at mail.kfki.hu (Kadlecsik Jozsef) Date: Tue, 31 Mar 2009 11:54:30 +0200 (CEST) Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: <49D11334.5030406@redhat.com> References: <1404804625.1710261238184677530.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <1a2a6dd60903272036k7bedef6ft718cf74331f562bc@mail.gmail.com> <49CE4B36.2000103@gmail.com> <49D03447.2000901@gmail.com> <49D0EC83.4030202@gmail.com> <49D11334.5030406@redhat.com> Message-ID: Hi, On Mon, 30 Mar 2009, Abhijith Das wrote: > Could you remove the patch associated with bz 466645 and see if you can > hit the hang again? I've looked at the patch and I can't spot anything > obvious. If this patch is causing your problems, I'll work on > reproducing the problem on my setup here and try to fix it. I'll restore the kernel on a not so critical node and will try to find out how to trigger the bug without mailman. If that succeeds then I'll remove the patch in question and re-run the test. It'll need a few days, surely, but I'll report the results. Best regards, Jozsef -- E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt Address: KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary From fernando at lozano.eti.br Tue Mar 31 14:16:40 2009 From: fernando at lozano.eti.br (Fernando Lozano) Date: Tue, 31 Mar 2009 11:16:40 -0300 Subject: [Linux-cluster] rhcs x iptables In-Reply-To: <49cbd269.a2.1b3e.193823380@lozano.eti.br> References: <49cbd269.a2.1b3e.193823380@lozano.eti.br> Message-ID: <49D225C8.90203@lozano.eti.br> Hi, Four days and no replies... maybe you folks don't like me as the list has a healthy trafic on other topics ;-) Is there anything with my setup that shouldn't work? The problem is not with VMs because I tried the same configs with two real Dell servers and got the same problems. My iptables rules follow what's in RHCS manuals and wiki, and I found nothing new with netstat -a. Even them rgmanager only works correctly with iptables turned off (that is, iptables -F). If I start iptables (service iptables start) and then try to start cman and rgmanager, it won't work to flush iptables rules, I am forced to power off because rgmanager won't work and won't stop. My setup is simple: no clvm, no gfs, no gnbd. Just rgmanager and an http service configured as a script and an ip resource. But with iptables on, rgmanager won't relocate or failover the http service. More strange, system-config-cluster shows the service status only on the first node, on the second one it shows an emply service list. What can I do to debug the problem, as my /var/log/messages don't show any error messages, just what apears to be a regular two-node cluster startup? []s, Fernando Lozano > Hi there, > > I have a Fedora 10 system with two KVM virtual machines, both running RHEL 5.2 and RHCS. The intent > is to prototype a cluster configuration for a customer. > > The problem is, everything is fine unless I start iptables on the VMs. But it's unacceptable to run > the cluster without am OS-level firewall. The ports list on rhcs manuals, on the cluster project > wiki, and what I observe using netstat do not agree. None of them talks about port 5149 which I > observe being opened by aisexec (cman). And I don't see any use of ports 41966 through 41968 which > are supposed to be opened my rgmanager or 5404 by cman. > > But even after I changed my iptables config to open all ports, I still canot relocate or failover > services between nodes. > > I configured apache as a script service to play with cluster administration. My vms are on the > default KVM network, 192.168.122./24. > > It's very strange system-config-cluster on node 1 shows both nodes (cs1 and cs2) joined the cluster > and starts my teste-httpd service, but node 2 doesn't show the status of any cluster service (on > system-config-cluster). > > If I try to use clusvnadm to relocate the service from cs1 to cs2, it hangs. And I can't stop > rgmanager with iptables enabled. Flushing iptables doesn't help when cman and rgmanager were started > with iptables on. > > Attached are my cluster.conf, /etc/sysconfig/iptables and netstat -anp > > > []s, Fernando Lozano > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jumanjiman at gmail.com Tue Mar 31 14:21:00 2009 From: jumanjiman at gmail.com (jumanjiman at gmail.com) Date: Tue, 31 Mar 2009 14:21:00 +0000 Subject: [Linux-cluster] rhcs x iptables In-Reply-To: <49D225C8.90203@lozano.eti.br> References: <49cbd269.a2.1b3e.193823380@lozano.eti.br><49D225C8.90203@lozano.eti.br> Message-ID: <1304170106-1238509264-cardhu_decombobulator_blackberry.rim.net-280687792-@bxe1288.bisx.prod.on.blackberry> Add some LOG rules to your netfilter config. Use wireshark. Between those two you will find the issue. -paul Sent via BlackBerry by AT&T -----Original Message----- From: Fernando Lozano Date: Tue, 31 Mar 2009 11:16:40 To: linux clustering Subject: Re: [Linux-cluster] rhcs x iptables Hi, Four days and no replies... maybe you folks don't like me as the list has a healthy trafic on other topics ;-) Is there anything with my setup that shouldn't work? The problem is not with VMs because I tried the same configs with two real Dell servers and got the same problems. My iptables rules follow what's in RHCS manuals and wiki, and I found nothing new with netstat -a. Even them rgmanager only works correctly with iptables turned off (that is, iptables -F). If I start iptables (service iptables start) and then try to start cman and rgmanager, it won't work to flush iptables rules, I am forced to power off because rgmanager won't work and won't stop. My setup is simple: no clvm, no gfs, no gnbd. Just rgmanager and an http service configured as a script and an ip resource. But with iptables on, rgmanager won't relocate or failover the http service. More strange, system-config-cluster shows the service status only on the first node, on the second one it shows an emply service list. What can I do to debug the problem, as my /var/log/messages don't show any error messages, just what apears to be a regular two-node cluster startup? []s, Fernando Lozano > Hi there, > > I have a Fedora 10 system with two KVM virtual machines, both running RHEL 5.2 and RHCS. The intent > is to prototype a cluster configuration for a customer. > > The problem is, everything is fine unless I start iptables on the VMs. But it's unacceptable to run > the cluster without am OS-level firewall. The ports list on rhcs manuals, on the cluster project > wiki, and what I observe using netstat do not agree. None of them talks about port 5149 which I > observe being opened by aisexec (cman). And I don't see any use of ports 41966 through 41968 which > are supposed to be opened my rgmanager or 5404 by cman. > > But even after I changed my iptables config to open all ports, I still canot relocate or failover > services between nodes. > > I configured apache as a script service to play with cluster administration. My vms are on the > default KVM network, 192.168.122./24. > > It's very strange system-config-cluster on node 1 shows both nodes (cs1 and cs2) joined the cluster > and starts my teste-httpd service, but node 2 doesn't show the status of any cluster service (on > system-config-cluster). > > If I try to use clusvnadm to relocate the service from cs1 to cs2, it hangs. And I can't stop > rgmanager with iptables enabled. Flushing iptables doesn't help when cman and rgmanager were started > with iptables on. > > Attached are my cluster.conf, /etc/sysconfig/iptables and netstat -anp > > > []s, Fernando Lozano > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From vu at sivell.com Tue Mar 31 15:28:23 2009 From: vu at sivell.com (vu pham) Date: Tue, 31 Mar 2009 09:28:23 -0600 Subject: [Linux-cluster] rhcs x iptables In-Reply-To: <49D225C8.90203@lozano.eti.br> References: <49cbd269.a2.1b3e.193823380@lozano.eti.br> <49D225C8.90203@lozano.eti.br> Message-ID: <49D23697.50609@sivell.com> Fernando Lozano wrote: > Hi, > > Four days and no replies... maybe you folks don't like me as the list > has a healthy trafic on other topics ;-) > > Is there anything with my setup that shouldn't work? The problem is not > with VMs because I tried the same configs with two real Dell servers and > got the same problems. My iptables rules follow what's in RHCS manuals > and wiki, and I found nothing new with netstat -a. > > Even them rgmanager only works correctly with iptables turned off (that > is, iptables -F). If I start iptables (service iptables start) and then > try to start cman and rgmanager, it won't work to flush iptables rules, > I am forced to power off because rgmanager won't work and won't stop. > > My setup is simple: no clvm, no gfs, no gnbd. Just rgmanager and an http > service configured as a script and an ip resource. But with iptables on, > rgmanager won't relocate or failover the http service. More strange, > system-config-cluster shows the service status only on the first node, > on the second one it shows an emply service list. > > What can I do to debug the problem, as my /var/log/messages don't show > any error messages, just what apears to be a regular two-node cluster > startup? Right before the last iptables command which usually blocks all other connections, add a LOG command to log all denied connections. Clustering uses many ports and multicast. One time I had a fencing problem using virtual fence on Xen, it turned out the multicast was blocked on then Xen host Dom0. > > > []s, Fernando Lozano > >> Hi there, >> >> I have a Fedora 10 system with two KVM virtual machines, both running RHEL 5.2 and RHCS. The intent >> is to prototype a cluster configuration for a customer. >> >> The problem is, everything is fine unless I start iptables on the VMs. But it's unacceptable to run >> the cluster without am OS-level firewall. The ports list on rhcs manuals, on the cluster project >> wiki, and what I observe using netstat do not agree. None of them talks about port 5149 which I >> observe being opened by aisexec (cman). And I don't see any use of ports 41966 through 41968 which >> are supposed to be opened my rgmanager or 5404 by cman. >> >> But even after I changed my iptables config to open all ports, I still canot relocate or failover >> services between nodes. >> >> I configured apache as a script service to play with cluster administration. My vms are on the >> default KVM network, 192.168.122./24. >> >> It's very strange system-config-cluster on node 1 shows both nodes (cs1 and cs2) joined the cluster >> and starts my teste-httpd service, but node 2 doesn't show the status of any cluster service (on >> system-config-cluster). >> >> If I try to use clusvnadm to relocate the service from cs1 to cs2, it hangs. And I can't stop >> rgmanager with iptables enabled. Flushing iptables doesn't help when cman and rgmanager were started >> with iptables on. >> >> Attached are my cluster.conf, /etc/sysconfig/iptables and netstat -anp >> >> >> []s, Fernando Lozano >> >> From teigland at redhat.com Tue Mar 31 16:10:07 2009 From: teigland at redhat.com (David Teigland) Date: Tue, 31 Mar 2009 11:10:07 -0500 Subject: [Linux-cluster] Freeze with cluster-2.03.11 In-Reply-To: References: <20090330164216.GA6135@redhat.com> Message-ID: <20090331161007.GA32540@redhat.com> On Tue, Mar 31, 2009 at 11:18:51AM +0200, Kadlecsik Jozsef wrote: > On Mon, 30 Mar 2009, David Teigland wrote: > > > On Fri, Mar 27, 2009 at 06:19:50PM +0100, Kadlecsik Jozsef wrote: > > > > > > Combing through the log files I found the following: > > > > > > Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec post_fail_delay > > > Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs" > > > Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1??e1?? > > > Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success > > > > > > The line saying "can't get node number for node e1??e1??" might be > > > innocent, but looks suspicious. Why fenced could not get the victim name? > > > > I've not seen that before, and I can't explain either how cman_get_node() > > could have failed or why it printed a garbage string. It's a non-essential > > bit of code, so that error should not be related to your problem. > > Yes, it is surely not related to the freeze, but disturbing. > > Hm, in the function dispatch_fence_agent there's an ordering issue, I > believe. The variable victim_nodename is freed but update_cman is called > with variable victim pointing to the just freed victim_nodename. Ah, you're exactly right, thanks for finding that. This bug was fixed in the STABLE3 branch, and I've just pushed a fix for the next 2.03 release. This bug will cause secondary fence methods to fail, so it's more serious than the garbage string. (Strangly, the victim_nodename code doesn't exist at all in the RHEL5 branch, which is why we didn't catch this. I'm not sure how RHEL5/STABLE2 got out of sync there, that's not supposed to happen.) Dave