From tserong at suse.com  Mon Dec  1 10:14:59 2014
From: tserong at suse.com (Tim Serong)
Date: Mon, 01 Dec 2014 21:14:59 +1100
Subject: [Linux-cluster] [ha-wg] [Pacemaker] [RFC] Organizing HA Summit
	2015
In-Reply-To: <54734B52.50708@alteeve.ca>
References: <540D853F.3090109@redhat.com>	<20141124143957.GU2508@suse.de>	<547346A9.6010901@redhat.com>	<20141124151235.GX2508@suse.de>
	<54734B52.50708@alteeve.ca>
Message-ID: <547C3FA3.2070109@suse.com>

On 11/25/2014 02:14 AM, Digimer wrote:
> On 24/11/14 10:12 AM, Lars Marowsky-Bree wrote:
>> Beijing, the US, Tasmania (OK, one crazy guy), various countries in
> 
> Oh, bring him! crazy++

What, you want to bring the guy who's boldly maintaining the outpost on
the southern frontier? ;)

*cough*

Barring a miracle or a sudden huge advance in matter transporter
technology I'm rather unlikely to make it, I'm afraid.  But I'll add my
voice to what Lars said in another email - go all physical (with good
minutes/notes/etherpads for others to review - which I assume is what's
going to happen this time), or all virtual.  Mixing the two is
exceedingly difficult to do well, IMO.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tserong at suse.com


From nagemnna at gmail.com  Mon Dec  1 14:16:11 2014
From: nagemnna at gmail.com (Megan .)
Date: Mon, 1 Dec 2014 09:16:11 -0500
Subject: [Linux-cluster] new cluster acting odd
Message-ID: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>

Good Day,

I'm fairly new to the cluster world so i apologize in advance for
silly questions.  Thank you for any help.

We decided to use this cluster solution in order to share GFS2 mounts
across servers.  We have a 7 node cluster that is newly setup, but
acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
with Idracs).  They are all running Centos 6.6.  I have fencing
working (I'm able to do fence_node node and it will fence with
success).  I do not have the gfs2 mounts in the cluster yet.

When I don't touch the servers, my cluster looks perfect with all
nodes online. But when I start testing fencing, I have an odd problem
where i end up with split brain between some of the nodes.  They won't
seem to automatically fence each other when it gets like this.

in the  corosync.log for the node that gets split out i see the totem
chatter, but it seems confused and just keeps doing the below over and
over:


Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b

Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
21 23 24 25 26 27 28 29 2a 2b 32
..
..
..
Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c


I can manually fence it, and it still comes online with the same
issue.  I end up having to take the whole cluster down, sometimes
forcing reboot on some nodes, then brining it back up.  Its takes a
good part of the day just to bring the whole cluster online again.

I used ccs -h node --sync --activate and double checked to make sure
they are all using the same version of the cluster.conf file.

Once issue I did notice, is that when one of the vmware hosts is
rebooted, the time comes off slitty skewed (6 seconds) but i thought i
read somewhere that a skew that minor shouldn't impact the cluster.

We have multicast enabled on the interfaces

          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
and we have been told by our network team that IGMP snooping is disabled.

With tcpdump I can see the multi-cast traffic chatter.

Right now:

[root at data1-uat ~]# clustat
Cluster Status for projectuat @ Mon Dec  1 13:56:39 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Online
 admin1-uat.domain.com                                  2 Online
 mgmt1-uat.domain.com                                   3 Online
 map1-uat.domain.com                                    4 Online
 map2-uat.domain.com                                    5 Online
 cache1-uat.domain.com                                  6 Online
 data1-uat.domain.com                                   8 Online, Local


** Has itself ass online **
[root at map1-uat ~]# clustat
Cluster Status for projectuat @ Mon Dec  1 13:57:07 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Online
 admin1-uat.domain.com                                  2 Online
 mgmt1-uat.domain.com                                   3 Online
 map1-uat.domain.com                                    4 Offline, Local
 map2-uat.domain.com                                    5 Online
 cache1-uat.domain.com                                  6 Online
 data1-uat.domain.com                                   8 Online


[root at cache1-uat ~]# clustat
Cluster Status for projectuat @ Mon Dec  1 13:57:39 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Online
 admin1-uat.domain.com                                  2 Online
 mgmt1-uat.domain.com                                   3 Online
 map1-uat.domain.com                                    4 Online
 map2-uat.domain.com                                    5 Online
 cache1-uat.domain.com                                  6 Offline, Local
 data1-uat.domain.com                                   8 Online


[root at mgmt1-uat ~]# clustat
Cluster Status for projectuat @ Mon Dec  1 13:58:04 2014
Member Status: Inquorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.domain.com                                1 Offline
 admin1-uat.domain.com                                  2 Offline
 mgmt1-uat.domain.com                                   3 Online, Local
 map1-uat.domain.com                                    4 Offline
 map2-uat.domain.com                                    5 Offline
 cache1-uat.domain.com                                  6 Offline
 data1-uat.domain.com                                   8 Offline


cman-3.0.12.1-68.el6.x86_64


[root at data1-uat ~]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="66" name="projectuat">
<clusternodes>
<clusternode name="admin1-uat.domain.com" nodeid="2">
<fence>
<method name="fenceadmin1uat">
<device name="vcappliancesoap" port="admin1-uat" ssl="on"
uuid="421df3c4-a686-9222-366e-9a67b25f62b2"/>
</method>
</fence>
</clusternode>
<clusternode name="mgmt1-uat.domain.com" nodeid="3">
<fence>
<method name="fenceadmin1uat">
<device name="vcappliancesoap" port="mgmt1-uat" ssl="on"
uuid="421d5ff5-66fa-5703-66d3-97f845cf8239"/>
</method>
</fence>
</clusternode>
<clusternode name="map1-uat.domain.com" nodeid="4">
<fence>
<method name="fencemap1uat">
<device name="idracmap1uat"/>
</method>
</fence>
</clusternode>
<clusternode name="map2-uat.domain.com" nodeid="5">
<fence>
<method name="fencemap2uat">
<device name="idracmap2uat"/>
</method>
</fence>
</clusternode>
<clusternode name="cache1-uat.domain.com" nodeid="6">
<fence>
<method name="fencecache1uat">
<device name="idraccache1uat"/>
</method>
</fence>
</clusternode>
<clusternode name="data1-uat.domain.com" nodeid="8">
<fence>
<method name="fencedata1uat">
<device name="idracdata1uat"/>
</method>
</fence>
</clusternode>
<clusternode name="archive1-uat.domain.com" nodeid="1">
<fence>
<method name="fenceadmin1uat">
<device name="vcappliancesoap" port="archive1-uat" ssl="on"
uuid="421d16b2-3ed0-0b9b-d530-0b151d81d24e"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice agent="fence_vmware_soap" ipaddr="x.x.x.130"
login="fenceuat" login_timeout="10" name="vcappliancesoap"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="10"
power_wait="30" retry_on="3" shell_timeout="10" ssl="1"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.47" login="fenceuat" name="idracdata1uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.48" login="fenceuat" name="idracdata2uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.82" login="fenceuat" name="idracmap1uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.96" login="fenceuat" name="idracmap2uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.83" login="fenceuat" name="idraccache1uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
<fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
ipaddr="x.x.x.97" login="fenceuat" name="idraccache2uat"
passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
</fencedevices>
</cluster>


From lists at alteeve.ca  Mon Dec  1 15:57:19 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 01 Dec 2014 10:57:19 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
Message-ID: <547C8FDF.1070000@alteeve.ca>

On 01/12/14 09:16 AM, Megan . wrote:
> Good Day,
>
> I'm fairly new to the cluster world so i apologize in advance for
> silly questions.  Thank you for any help.

No pre-existing knowledge required, no need to apologize. :)

> We decided to use this cluster solution in order to share GFS2 mounts
> across servers.  We have a 7 node cluster that is newly setup, but
> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
> with Idracs).  They are all running Centos 6.6.  I have fencing
> working (I'm able to do fence_node node and it will fence with
> success).  I do not have the gfs2 mounts in the cluster yet.

Very glad you have fencing, that's a common early mistake.

7-node cluster is actually pretty large and is around the upper-end 
before tuning starts to become fairly important.

> When I don't touch the servers, my cluster looks perfect with all
> nodes online. But when I start testing fencing, I have an odd problem
> where i end up with split brain between some of the nodes.  They won't
> seem to automatically fence each other when it gets like this.

If you get a split-brain, something is seriously broken. Either the 
fencing isn't properly working (getting a false success from the agent, 
for example). Can you pastebin your cluster.conf (or fpaste or something 
where tabs are preserved to make it more readible)?

> in the  corosync.log for the node that gets split out i see the totem
> chatter, but it seems confused and just keeps doing the below over and
> over:
>
> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>
> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 21 23 24 25 26 27 28 29 2a 2b 32
> ..
> ..
> ..
> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

This is a sign of network congestion. This is the node saying "I lost 
some (corosync) data, please retransmit".

> I can manually fence it, and it still comes online with the same
> issue.  I end up having to take the whole cluster down, sometimes
> forcing reboot on some nodes, then brining it back up.  Its takes a
> good part of the day just to bring the whole cluster online again.

Something fence related is not working.

> I used ccs -h node --sync --activate and double checked to make sure
> they are all using the same version of the cluster.conf file.

You can also use 'cman_tool version'.

> Once issue I did notice, is that when one of the vmware hosts is
> rebooted, the time comes off slitty skewed (6 seconds) but i thought i
> read somewhere that a skew that minor shouldn't impact the cluster.

Iirc, before RHEL 6.2, this was a problem. Now though, it shouldn't be. 
I am more curious what might be underlying the skew, rather than the 
skew itself being a concern.

> We have multicast enabled on the interfaces
>
>            UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
> and we have been told by our network team that IGMP snooping is disabled.
>
> With tcpdump I can see the multi-cast traffic chatter.
>
> Right now:
>
> [root at data1-uat ~]# clustat
> Cluster Status for projectuat @ Mon Dec  1 13:56:39 2014
> Member Status: Quorate
>
>   Member Name                                                     ID   Status
>   ------ ----                                                     ---- ------
>   archive1-uat.domain.com                                1 Online
>   admin1-uat.domain.com                                  2 Online
>   mgmt1-uat.domain.com                                   3 Online
>   map1-uat.domain.com                                    4 Online
>   map2-uat.domain.com                                    5 Online
>   cache1-uat.domain.com                                  6 Online
>   data1-uat.domain.com                                   8 Online, Local
>
>
>
> ** Has itself ass online **
> [root at map1-uat ~]# clustat
> Cluster Status for projectuat @ Mon Dec  1 13:57:07 2014
> Member Status: Quorate
>
>   Member Name                                                     ID   Status
>   ------ ----                                                     ---- ------
>   archive1-uat.domain.com                                1 Online
>   admin1-uat.domain.com                                  2 Online
>   mgmt1-uat.domain.com                                   3 Online
>   map1-uat.domain.com                                    4 Offline, Local
>   map2-uat.domain.com                                    5 Online
>   cache1-uat.domain.com                                  6 Online
>   data1-uat.domain.com                                   8 Online

That is really, really odd. I think we'll need one of the red hat folks 
to chime in.

> [root at cache1-uat ~]# clustat
> Cluster Status for projectuat @ Mon Dec  1 13:57:39 2014
> Member Status: Quorate
>
>   Member Name                                                     ID   Status
>   ------ ----                                                     ---- ------
>   archive1-uat.domain.com                                1 Online
>   admin1-uat.domain.com                                  2 Online
>   mgmt1-uat.domain.com                                   3 Online
>   map1-uat.domain.com                                    4 Online
>   map2-uat.domain.com                                    5 Online
>   cache1-uat.domain.com                                  6 Offline, Local
>   data1-uat.domain.com                                   8 Online
>
>
>
> [root at mgmt1-uat ~]# clustat
> Cluster Status for projectuat @ Mon Dec  1 13:58:04 2014
> Member Status: Inquorate
>
>   Member Name                                                     ID   Status
>   ------ ----                                                     ---- ------
>   archive1-uat.domain.com                                1 Offline
>   admin1-uat.domain.com                                  2 Offline
>   mgmt1-uat.domain.com                                   3 Online, Local
>   map1-uat.domain.com                                    4 Offline
>   map2-uat.domain.com                                    5 Offline
>   cache1-uat.domain.com                                  6 Offline
>   data1-uat.domain.com                                   8 Offline
>
>
> cman-3.0.12.1-68.el6.x86_64
>
>
> [root at data1-uat ~]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster config_version="66" name="projectuat">
> <clusternodes>
> <clusternode name="admin1-uat.domain.com" nodeid="2">
> <fence>
> <method name="fenceadmin1uat">
> <device name="vcappliancesoap" port="admin1-uat" ssl="on"
> uuid="421df3c4-a686-9222-366e-9a67b25f62b2"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="mgmt1-uat.domain.com" nodeid="3">
> <fence>
> <method name="fenceadmin1uat">
> <device name="vcappliancesoap" port="mgmt1-uat" ssl="on"
> uuid="421d5ff5-66fa-5703-66d3-97f845cf8239"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="map1-uat.domain.com" nodeid="4">
> <fence>
> <method name="fencemap1uat">
> <device name="idracmap1uat"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="map2-uat.domain.com" nodeid="5">
> <fence>
> <method name="fencemap2uat">
> <device name="idracmap2uat"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="cache1-uat.domain.com" nodeid="6">
> <fence>
> <method name="fencecache1uat">
> <device name="idraccache1uat"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="data1-uat.domain.com" nodeid="8">
> <fence>
> <method name="fencedata1uat">
> <device name="idracdata1uat"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="archive1-uat.domain.com" nodeid="1">
> <fence>
> <method name="fenceadmin1uat">
> <device name="vcappliancesoap" port="archive1-uat" ssl="on"
> uuid="421d16b2-3ed0-0b9b-d530-0b151d81d24e"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <fencedevices>
> <fencedevice agent="fence_vmware_soap" ipaddr="x.x.x.130"
> login="fenceuat" login_timeout="10" name="vcappliancesoap"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="10"
> power_wait="30" retry_on="3" shell_timeout="10" ssl="1"/>
> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
> ipaddr="x.x.x.47" login="fenceuat" name="idracdata1uat"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
> ipaddr="x.x.x.48" login="fenceuat" name="idracdata2uat"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
> ipaddr="x.x.x.82" login="fenceuat" name="idracmap1uat"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
> ipaddr="x.x.x.96" login="fenceuat" name="idracmap2uat"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
> ipaddr="x.x.x.83" login="fenceuat" name="idraccache1uat"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
> ipaddr="x.x.x.97" login="fenceuat" name="idraccache2uat"
> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
> </fencedevices>
> </cluster>

-ENOPARSE

My recommendation would be to schedule a maintenance window and then 
stop everything except cman (no rgmanager, no gfs2, etc). Then 
methodically test crashing all nodes (I like 'echo c > 
/proc/sysrq-trigger) and verify they are fenced and then recover 
properly. It's worth disabling cman and rgmanager from starting at boot 
(period, but particularly for this test).

If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd 
start loading back services and re-trying. If the problem reappears only 
under load, then that's an indication of the problem, too.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From dan131riley at gmail.com  Mon Dec  1 16:34:32 2014
From: dan131riley at gmail.com (Dan Riley)
Date: Mon, 1 Dec 2014 11:34:32 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <547C8FDF.1070000@alteeve.ca>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
	<547C8FDF.1070000@alteeve.ca>
Message-ID: <8BA76491-3ED8-4F4F-A38E-7F118767270A@gmail.com>


> On Dec 1, 2014, at 10:57, Digimer <lists at alteeve.ca> wrote:
> 
> On 01/12/14 09:16 AM, Megan . wrote:
>> We decided to use this cluster solution in order to share GFS2 mounts
>> across servers.  We have a 7 node cluster that is newly setup, but
>> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
>> with Idracs).  They are all running Centos 6.6.  I have fencing
>> working (I'm able to do fence_node node and it will fence with
>> success).  I do not have the gfs2 mounts in the cluster yet.
> 
> Very glad you have fencing, that's a common early mistake.
> 
> 7-node cluster is actually pretty large and is around the upper-end before tuning starts to become fairly important.

Ha, I was unaware this was part of the folklore.  We have a couple of 9-node clusters, it did take some tuning to get them stable, and we are thinking about splitting one of them.  For our clusters, we found uniform configuration helped a lot, so a mix of physical hosts and VMs in the same (largish) cluster would make me a little nervous, don't know about anyone else's feelings.

>> I can manually fence it, and it still comes online with the same
>> issue.  I end up having to take the whole cluster down, sometimes
>> forcing reboot on some nodes, then brining it back up.  Its takes a
>> good part of the day just to bring the whole cluster online again.
> 
> Something fence related is not working.

We used to see something like this, and it usually traced back to "shouldn't be possible" inconsistencies in the fence group membership.  Once the fence group gets blocked by a mismatched membership list, everything above it breaks.  Sometimes a "fence_tool ls -n" on all the cluster members will reveal an inconsistency ("fence_tool dump" also, but that's hard to interpret without digging into the group membership protocols).  If you can find an inconsistency, manually fencing the nodes in the minority might repair it.

At the time, I did quite a lot of staring at fence_tool dumps, but never figured out how the fence group was getting into "shouldn't be possible" inconsistencies.  This was also all late RHEL5 and early RHEL6, so may not be applicable anymore.

> My recommendation would be to schedule a maintenance window and then stop everything except cman (no rgmanager, no gfs2, etc). Then methodically test crashing all nodes (I like 'echo c > /proc/sysrq-trigger) and verify they are fenced and then recover properly. It's worth disabling cman and rgmanager from starting at boot (period, but particularly for this test).
> 
> If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd start loading back services and re-trying. If the problem reappears only under load, then that's an indication of the problem, too.

I'd agree--start at the bottom of the stack and work your way up.

-dan


From nagemnna at gmail.com  Mon Dec  1 16:56:51 2014
From: nagemnna at gmail.com (Megan .)
Date: Mon, 1 Dec 2014 11:56:51 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <547C8FDF.1070000@alteeve.ca>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
	<547C8FDF.1070000@alteeve.ca>
Message-ID: <CACMA5-xzK4nF7JkfchKY732bzC+nZ3EJ9ga5Yp0NHxn5toEmqg@mail.gmail.com>

Thank you for your replies.

The cluster is intended to be 9 nodes, but i haven't finished building
the remaining 2.  Our production cluster is expected to be similar in
size.  What tuning should I be looking at?


Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
remove IP addresses.


I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
the cluster kept seeing it as online and never fenced it, yet i could
no longer ssh to the node.  I did this on a physical and VM box with
the same result.  I had to fence_node node to get it to reboot, but it
came up split brained (thinking it was the only one online). Now that
node has cman down and the rest of the cluster sees it as still
online.

I thought fencing was working because i'm able to do fence_node node
and see the box reboot and come back online.  I did have to get the FC
version of the fence_agents because of an issue with the idrac agent
not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64


fence_tool dump worked on one of my nodes, but it is just hanging on the rest.

[root at map1-uat ~]# fence_tool dump
1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/fenced.log
1417448610 fenced 3.0.12.1 started
1417448610 connected to dbus :1.12
1417448610 cluster node 1 added seq 89048
1417448610 cluster node 2 added seq 89048
1417448610 cluster node 3 added seq 89048
1417448610 cluster node 4 added seq 89048
1417448610 cluster node 5 added seq 89048
1417448610 cluster node 6 added seq 89048
1417448610 cluster node 8 added seq 89048
1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/fenced.log
1417448611 logfile cur mode 100644
1417448611 cpg_join fenced:daemon ...
1417448621 daemon cpg_join error retrying
1417448631 daemon cpg_join error retrying
1417448641 daemon cpg_join error retrying
1417448651 daemon cpg_join error retrying
1417448661 daemon cpg_join error retrying
1417448671 daemon cpg_join error retrying
1417448681 daemon cpg_join error retrying
1417448691 daemon cpg_join error retrying
.
.
.


[root at map1-uat ~]# clustat
Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 archive1-uat.project.domain.com                                1 Online
 admin1-uat.project.domain.com                                  2 Online
 mgmt1-uat.project.domain.com                                   3 Online
 map1-uat.project.domain.com                                    4 Online, Local
 map2-uat.project.domain.com                                    5 Online
 cache1-uat.project.domain.com                                 6 Online
 data1-uat.project.domain.com                                   8 Online


The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
16:02:34 fenced cpg_join error retrying every 10th of a second.

Obviously having some major issues.  These are fresh boxes, no other
services right now other then ones related to the cluster.

I've also experimented with the  <cman transport="udpu"/> to disable
multicast to see if that helped but it doesn't seem to make a
difference with the node stability.

Is there a document or some sort of reference that I can give the
network folks on how the switches should be configured?  I read stuff
on boards about IGMP snooping, but I couldn't find anything from
RedHat to hand them.

On Mon, Dec 1, 2014 at 10:57 AM, Digimer <lists at alteeve.ca> wrote:
> On 01/12/14 09:16 AM, Megan . wrote:
>>
>> Good Day,
>>
>> I'm fairly new to the cluster world so i apologize in advance for
>> silly questions.  Thank you for any help.
>
>
> No pre-existing knowledge required, no need to apologize. :)
>
>> We decided to use this cluster solution in order to share GFS2 mounts
>> across servers.  We have a 7 node cluster that is newly setup, but
>> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
>> with Idracs).  They are all running Centos 6.6.  I have fencing
>> working (I'm able to do fence_node node and it will fence with
>> success).  I do not have the gfs2 mounts in the cluster yet.
>
>
> Very glad you have fencing, that's a common early mistake.
>
> 7-node cluster is actually pretty large and is around the upper-end before
> tuning starts to become fairly important.
>
>> When I don't touch the servers, my cluster looks perfect with all
>> nodes online. But when I start testing fencing, I have an odd problem
>> where i end up with split brain between some of the nodes.  They won't
>> seem to automatically fence each other when it gets like this.
>
>
> If you get a split-brain, something is seriously broken. Either the fencing
> isn't properly working (getting a false success from the agent, for
> example). Can you pastebin your cluster.conf (or fpaste or something where
> tabs are preserved to make it more readible)?
>
>> in the  corosync.log for the node that gets split out i see the totem
>> chatter, but it seems confused and just keeps doing the below over and
>> over:
>>
>> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>>
>> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 21 23 24 25 26 27 28 29 2a 2b 32
>> ..
>> ..
>> ..
>> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
>
> This is a sign of network congestion. This is the node saying "I lost some
> (corosync) data, please retransmit".
>
>> I can manually fence it, and it still comes online with the same
>> issue.  I end up having to take the whole cluster down, sometimes
>> forcing reboot on some nodes, then brining it back up.  Its takes a
>> good part of the day just to bring the whole cluster online again.
>
>
> Something fence related is not working.
>
>> I used ccs -h node --sync --activate and double checked to make sure
>> they are all using the same version of the cluster.conf file.
>
>
> You can also use 'cman_tool version'.
>
>> Once issue I did notice, is that when one of the vmware hosts is
>> rebooted, the time comes off slitty skewed (6 seconds) but i thought i
>> read somewhere that a skew that minor shouldn't impact the cluster.
>
>
> Iirc, before RHEL 6.2, this was a problem. Now though, it shouldn't be. I am
> more curious what might be underlying the skew, rather than the skew itself
> being a concern.
>
>
>> We have multicast enabled on the interfaces
>>
>>            UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
>> and we have been told by our network team that IGMP snooping is disabled.
>>
>> With tcpdump I can see the multi-cast traffic chatter.
>>
>> Right now:
>>
>> [root at data1-uat ~]# clustat
>> Cluster Status for projectuat @ Mon Dec  1 13:56:39 2014
>> Member Status: Quorate
>>
>>   Member Name                                                     ID
>> Status
>>   ------ ----                                                     ----
>> ------
>>   archive1-uat.domain.com                                1 Online
>>   admin1-uat.domain.com                                  2 Online
>>   mgmt1-uat.domain.com                                   3 Online
>>   map1-uat.domain.com                                    4 Online
>>   map2-uat.domain.com                                    5 Online
>>   cache1-uat.domain.com                                  6 Online
>>   data1-uat.domain.com                                   8 Online, Local
>>
>>
>>
>> ** Has itself ass online **
>> [root at map1-uat ~]# clustat
>> Cluster Status for projectuat @ Mon Dec  1 13:57:07 2014
>> Member Status: Quorate
>>
>>   Member Name                                                     ID
>> Status
>>   ------ ----                                                     ----
>> ------
>>   archive1-uat.domain.com                                1 Online
>>   admin1-uat.domain.com                                  2 Online
>>   mgmt1-uat.domain.com                                   3 Online
>>   map1-uat.domain.com                                    4 Offline, Local
>>   map2-uat.domain.com                                    5 Online
>>   cache1-uat.domain.com                                  6 Online
>>   data1-uat.domain.com                                   8 Online
>
>
> That is really, really odd. I think we'll need one of the red hat folks to
> chime in.
>
>
>> [root at cache1-uat ~]# clustat
>> Cluster Status for projectuat @ Mon Dec  1 13:57:39 2014
>> Member Status: Quorate
>>
>>   Member Name                                                     ID
>> Status
>>   ------ ----                                                     ----
>> ------
>>   archive1-uat.domain.com                                1 Online
>>   admin1-uat.domain.com                                  2 Online
>>   mgmt1-uat.domain.com                                   3 Online
>>   map1-uat.domain.com                                    4 Online
>>   map2-uat.domain.com                                    5 Online
>>   cache1-uat.domain.com                                  6 Offline, Local
>>   data1-uat.domain.com                                   8 Online
>>
>>
>>
>> [root at mgmt1-uat ~]# clustat
>> Cluster Status for projectuat @ Mon Dec  1 13:58:04 2014
>> Member Status: Inquorate
>>
>>   Member Name                                                     ID
>> Status
>>   ------ ----                                                     ----
>> ------
>>   archive1-uat.domain.com                                1 Offline
>>   admin1-uat.domain.com                                  2 Offline
>>   mgmt1-uat.domain.com                                   3 Online, Local
>>   map1-uat.domain.com                                    4 Offline
>>   map2-uat.domain.com                                    5 Offline
>>   cache1-uat.domain.com                                  6 Offline
>>   data1-uat.domain.com                                   8 Offline
>>
>>
>> cman-3.0.12.1-68.el6.x86_64
>>
>>
>> [root at data1-uat ~]# cat /etc/cluster/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="66" name="projectuat">
>> <clusternodes>
>> <clusternode name="admin1-uat.domain.com" nodeid="2">
>> <fence>
>> <method name="fenceadmin1uat">
>> <device name="vcappliancesoap" port="admin1-uat" ssl="on"
>> uuid="421df3c4-a686-9222-366e-9a67b25f62b2"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="mgmt1-uat.domain.com" nodeid="3">
>> <fence>
>> <method name="fenceadmin1uat">
>> <device name="vcappliancesoap" port="mgmt1-uat" ssl="on"
>> uuid="421d5ff5-66fa-5703-66d3-97f845cf8239"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="map1-uat.domain.com" nodeid="4">
>> <fence>
>> <method name="fencemap1uat">
>> <device name="idracmap1uat"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="map2-uat.domain.com" nodeid="5">
>> <fence>
>> <method name="fencemap2uat">
>> <device name="idracmap2uat"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="cache1-uat.domain.com" nodeid="6">
>> <fence>
>> <method name="fencecache1uat">
>> <device name="idraccache1uat"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="data1-uat.domain.com" nodeid="8">
>> <fence>
>> <method name="fencedata1uat">
>> <device name="idracdata1uat"/>
>> </method>
>> </fence>
>> </clusternode>
>> <clusternode name="archive1-uat.domain.com" nodeid="1">
>> <fence>
>> <method name="fenceadmin1uat">
>> <device name="vcappliancesoap" port="archive1-uat" ssl="on"
>> uuid="421d16b2-3ed0-0b9b-d530-0b151d81d24e"/>
>> </method>
>> </fence>
>> </clusternode>
>> </clusternodes>
>> <fencedevices>
>> <fencedevice agent="fence_vmware_soap" ipaddr="x.x.x.130"
>> login="fenceuat" login_timeout="10" name="vcappliancesoap"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="10"
>> power_wait="30" retry_on="3" shell_timeout="10" ssl="1"/>
>> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
>> ipaddr="x.x.x.47" login="fenceuat" name="idracdata1uat"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
>> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
>> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
>> ipaddr="x.x.x.48" login="fenceuat" name="idracdata2uat"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
>> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
>> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
>> ipaddr="x.x.x.82" login="fenceuat" name="idracmap1uat"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
>> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
>> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
>> ipaddr="x.x.x.96" login="fenceuat" name="idracmap2uat"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
>> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
>> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
>> ipaddr="x.x.x.83" login="fenceuat" name="idraccache1uat"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
>> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
>> <fencedevice agent="fence_drac5" cmd_prompt="admin1-&gt;"
>> ipaddr="x.x.x.97" login="fenceuat" name="idraccache2uat"
>> passwd_script="/etc/cluster/forfencing.sh" power_timeout="60"
>> power_wait="60" retry_on="10" secure="on" shell_timeout="10"/>
>> </fencedevices>
>> </cluster>
>
>
> -ENOPARSE
>
> My recommendation would be to schedule a maintenance window and then stop
> everything except cman (no rgmanager, no gfs2, etc). Then methodically test
> crashing all nodes (I like 'echo c > /proc/sysrq-trigger) and verify they
> are fenced and then recover properly. It's worth disabling cman and
> rgmanager from starting at boot (period, but particularly for this test).
>
> If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd
> start loading back services and re-trying. If the problem reappears only
> under load, then that's an indication of the problem, too.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Mon Dec  1 17:02:43 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 01 Dec 2014 12:02:43 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <8BA76491-3ED8-4F4F-A38E-7F118767270A@gmail.com>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>	<547C8FDF.1070000@alteeve.ca>
	<8BA76491-3ED8-4F4F-A38E-7F118767270A@gmail.com>
Message-ID: <547C9F33.7030109@alteeve.ca>

On 01/12/14 11:34 AM, Dan Riley wrote:
> Ha, I was unaware this was part of the folklore.  We have a couple of 9-node clusters, it did take some tuning to get them stable, and we are thinking about splitting one of them.  For our clusters, we found uniform configuration helped a lot, so a mix of physical hosts and VMs in the same (largish) cluster would make me a little nervous, don't know about anyone else's feelings.

Personally, I only build 2-node clusters. When I need more resource, I 
drop-in another pair. This allows all my clusters, going back to 2008/9, 
to have nearly identical configurations. In HA, I would argue, nothing 
is more useful than consistency and simplicity.

That said, I'd not fault anyone for going to 4 or 5 node. Beyond that 
though, I would argue that the cluster should be broken up. In HA, 
uptime should always trump resource utilization efficiency.

Mixing real and virtual machines strikes me as an avoidable complexity, too.

>> Something fence related is not working.
>
> We used to see something like this, and it usually traced back to "shouldn't be possible" inconsistencies in the fence group membership.  Once the fence group gets blocked by a mismatched membership list, everything above it breaks.  Sometimes a "fence_tool ls -n" on all the cluster members will reveal an inconsistency ("fence_tool dump" also, but that's hard to interpret without digging into the group membership protocols).  If you can find an inconsistency, manually fencing the nodes in the minority might repair it.

In all my years, I've never seen this happen in production. If you can 
create a reproducer, I would *love* to see/examine it!

> At the time, I did quite a lot of staring at fence_tool dumps, but never figured out how the fence group was getting into "shouldn't be possible" inconsistencies.  This was also all late RHEL5 and early RHEL6, so may not be applicable anymore.

HA in 6.2+ seems to be quite a bit more stable (I think for more reasons 
than just the HA stuff). For this reason, I am staying on RHEL 6 until 
at least 7.2+ is out. :)

>> My recommendation would be to schedule a maintenance window and then stop everything except cman (no rgmanager, no gfs2, etc). Then methodically test crashing all nodes (I like 'echo c > /proc/sysrq-trigger) and verify they are fenced and then recover properly. It's worth disabling cman and rgmanager from starting at boot (period, but particularly for this test).
>>
>> If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd start loading back services and re-trying. If the problem reappears only under load, then that's an indication of the problem, too.
>
> I'd agree--start at the bottom of the stack and work your way up.
>
> -dan

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From lists at alteeve.ca  Mon Dec  1 17:14:22 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 01 Dec 2014 12:14:22 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <CACMA5-xzK4nF7JkfchKY732bzC+nZ3EJ9ga5Yp0NHxn5toEmqg@mail.gmail.com>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>	<547C8FDF.1070000@alteeve.ca>
	<CACMA5-xzK4nF7JkfchKY732bzC+nZ3EJ9ga5Yp0NHxn5toEmqg@mail.gmail.com>
Message-ID: <547CA1EE.9030001@alteeve.ca>

On 01/12/14 11:56 AM, Megan . wrote:
> Thank you for your replies.
>
> The cluster is intended to be 9 nodes, but i haven't finished building
> the remaining 2.  Our production cluster is expected to be similar in
> size.  What tuning should I be looking at?
>
>
> Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
> remove IP addresses.

Can you simplify those fencedevice definitions? I would wonder if the 
set timeouts could be part of the problem. Always start with the 
simplest possible configurations and only add options in response to 
actual issues discovered in testing.

> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
> the cluster kept seeing it as online and never fenced it, yet i could
> no longer ssh to the node.  I did this on a physical and VM box with
> the same result.  I had to fence_node node to get it to reboot, but it
> came up split brained (thinking it was the only one online). Now that
> node has cman down and the rest of the cluster sees it as still
> online.

Then corosync failed to detect the fault. That is a sign, to me, of a 
fundamental network or configuration issue. Corosync should have shown 
messages about a node being lost and reconfiguring. If that didn't 
happen, then you're not even up to the point where fencing factors in.

Did you configure corosync.conf? When it came up, did it think it was 
quorate or inquorate?

> I thought fencing was working because i'm able to do fence_node node
> and see the box reboot and come back online.  I did have to get the FC
> version of the fence_agents because of an issue with the idrac agent
> not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64

That tells you that the configuration of the fence agents is working, 
but it doesn't test failure detection. You can use the 'fence_check' 
tool to see if the cluster can talk to everything, but in the end, the 
only useful test is to simulate an actual crash.

Wait; 'fc14' ?! What OS are you using?

> fence_tool dump worked on one of my nodes, but it is just hanging on the rest.
>
> [root at map1-uat ~]# fence_tool dump
> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
> /var/log/cluster/fenced.log
> 1417448610 fenced 3.0.12.1 started
> 1417448610 connected to dbus :1.12
> 1417448610 cluster node 1 added seq 89048
> 1417448610 cluster node 2 added seq 89048
> 1417448610 cluster node 3 added seq 89048
> 1417448610 cluster node 4 added seq 89048
> 1417448610 cluster node 5 added seq 89048
> 1417448610 cluster node 6 added seq 89048
> 1417448610 cluster node 8 added seq 89048
> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
> /var/log/cluster/fenced.log
> 1417448611 logfile cur mode 100644
> 1417448611 cpg_join fenced:daemon ...
> 1417448621 daemon cpg_join error retrying
> 1417448631 daemon cpg_join error retrying
> 1417448641 daemon cpg_join error retrying
> 1417448651 daemon cpg_join error retrying
> 1417448661 daemon cpg_join error retrying
> 1417448671 daemon cpg_join error retrying
> 1417448681 daemon cpg_join error retrying
> 1417448691 daemon cpg_join error retrying
> .
> .
> .
>
>
> [root at map1-uat ~]# clustat
> Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
> Member Status: Quorate
>
>   Member Name                                                     ID   Status
>   ------ ----                                                     ---- ------
>   archive1-uat.project.domain.com                                1 Online
>   admin1-uat.project.domain.com                                  2 Online
>   mgmt1-uat.project.domain.com                                   3 Online
>   map1-uat.project.domain.com                                    4 Online, Local
>   map2-uat.project.domain.com                                    5 Online
>   cache1-uat.project.domain.com                                 6 Online
>   data1-uat.project.domain.com                                   8 Online
>
>
> The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
> 16:02:34 fenced cpg_join error retrying every 10th of a second.
>
> Obviously having some major issues.  These are fresh boxes, no other
> services right now other then ones related to the cluster.

What OS/version?

> I've also experimented with the  <cman transport="udpu"/> to disable
> multicast to see if that helped but it doesn't seem to make a
> difference with the node stability.

Very bad idea with >2~3 node clusters. The overhead will be far too 
great for a 7~9 node cluster.

> Is there a document or some sort of reference that I can give the
> network folks on how the switches should be configured?  I read stuff
> on boards about IGMP snooping, but I couldn't find anything from
> RedHat to hand them.

I have this:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network

There are comments in there about multicast, etc.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From nagemnna at gmail.com  Mon Dec  1 18:03:50 2014
From: nagemnna at gmail.com (Megan .)
Date: Mon, 1 Dec 2014 13:03:50 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <547CA1EE.9030001@alteeve.ca>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
	<547C8FDF.1070000@alteeve.ca>
	<CACMA5-xzK4nF7JkfchKY732bzC+nZ3EJ9ga5Yp0NHxn5toEmqg@mail.gmail.com>
	<547CA1EE.9030001@alteeve.ca>
Message-ID: <CACMA5-xVGQQUHj6-bdRQMtiHXp1T+X7XaYmGWnQ2KEPVip+i0w@mail.gmail.com>

We have 11 10-20TB GFS2 mounts that I need to share across all nodes.
Its the only reason we went with the cluster solution.  I don't know
how we could split it up into different smaller clusters.

On Mon, Dec 1, 2014 at 12:14 PM, Digimer <lists at alteeve.ca> wrote:
> On 01/12/14 11:56 AM, Megan . wrote:
>>
>> Thank you for your replies.
>>
>> The cluster is intended to be 9 nodes, but i haven't finished building
>> the remaining 2.  Our production cluster is expected to be similar in
>> size.  What tuning should I be looking at?
>>
>>
>> Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
>> remove IP addresses.
>
>
> Can you simplify those fencedevice definitions? I would wonder if the set
> timeouts could be part of the problem. Always start with the simplest
> possible configurations and only add options in response to actual issues
> discovered in testing.

I can try to simplify.  I had the longer timeouts because what I saw
happening on the physical boxes, was the box would be on its way
down/up and the fence command would fail, but the box actually did
come back online.  The physicals take 10-15 minutes to reboot and i
wasn't sure how to handle timeout issues, so i made the timeouts a bit
extreme for testing. I'll try to make the config more vanilla for
troubleshooting.


>> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
>> the cluster kept seeing it as online and never fenced it, yet i could
>> no longer ssh to the node.  I did this on a physical and VM box with
>> the same result.  I had to fence_node node to get it to reboot, but it
>> came up split brained (thinking it was the only one online). Now that
>> node has cman down and the rest of the cluster sees it as still
>> online.
>
>
> Then corosync failed to detect the fault. That is a sign, to me, of a
> fundamental network or configuration issue. Corosync should have shown
> messages about a node being lost and reconfiguring. If that didn't happen,
> then you're not even up to the point where fencing factors in.
>
> Did you configure corosync.conf? When it came up, did it think it was
> quorate or inquorate?


corosync.conf didn't work since it seems the RedHat HA Cluster doesn't
use that file.  http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf
 I tried it since we wanted to try to put the multicast traffic on a
different bond/vlan but we figured out the file isn't used.


>> I thought fencing was working because i'm able to do fence_node node
>> and see the box reboot and come back online.  I did have to get the FC
>> version of the fence_agents because of an issue with the idrac agent
>> not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64
>
>
> That tells you that the configuration of the fence agents is working, but it
> doesn't test failure detection. You can use the 'fence_check' tool to see if
> the cluster can talk to everything, but in the end, the only useful test is
> to simulate an actual crash.
>
> Wait; 'fc14' ?! What OS are you using?
>
>

We are Centos 6.6.  I went with the fedora core agents because of this
exact issue http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7
 I read that it was fixed in the next version, which i could only find
for FC.


>> fence_tool dump worked on one of my nodes, but it is just hanging on the
>> rest.
>>
>> [root at map1-uat ~]# fence_tool dump
>> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/fenced.log
>> 1417448610 fenced 3.0.12.1 started
>> 1417448610 connected to dbus :1.12
>> 1417448610 cluster node 1 added seq 89048
>> 1417448610 cluster node 2 added seq 89048
>> 1417448610 cluster node 3 added seq 89048
>> 1417448610 cluster node 4 added seq 89048
>> 1417448610 cluster node 5 added seq 89048
>> 1417448610 cluster node 6 added seq 89048
>> 1417448610 cluster node 8 added seq 89048
>> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
>> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/fenced.log
>> 1417448611 logfile cur mode 100644
>> 1417448611 cpg_join fenced:daemon ...
>> 1417448621 daemon cpg_join error retrying
>> 1417448631 daemon cpg_join error retrying
>> 1417448641 daemon cpg_join error retrying
>> 1417448651 daemon cpg_join error retrying
>> 1417448661 daemon cpg_join error retrying
>> 1417448671 daemon cpg_join error retrying
>> 1417448681 daemon cpg_join error retrying
>> 1417448691 daemon cpg_join error retrying
>> .
>> .
>> .
>>
>>
>> [root at map1-uat ~]# clustat
>> Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
>> Member Status: Quorate
>>
>>   Member Name                                                     ID
>> Status
>>   ------ ----                                                     ----
>> ------
>>   archive1-uat.project.domain.com                                1 Online
>>   admin1-uat.project.domain.com                                  2 Online
>>   mgmt1-uat.project.domain.com                                   3 Online
>>   map1-uat.project.domain.com                                    4 Online,
>> Local
>>   map2-uat.project.domain.com                                    5 Online
>>   cache1-uat.project.domain.com                                 6 Online
>>   data1-uat.project.domain.com                                   8 Online
>>
>>
>> The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
>> 16:02:34 fenced cpg_join error retrying every 10th of a second.
>>
>> Obviously having some major issues.  These are fresh boxes, no other
>> services right now other then ones related to the cluster.
>
>
> What OS/version?
>
>> I've also experimented with the  <cman transport="udpu"/> to disable
>> multicast to see if that helped but it doesn't seem to make a
>> difference with the node stability.
>
>
> Very bad idea with >2~3 node clusters. The overhead will be far too great
> for a 7~9 node cluster.
>
>> Is there a document or some sort of reference that I can give the
>> network folks on how the switches should be configured?  I read stuff
>> on boards about IGMP snooping, but I couldn't find anything from
>> RedHat to hand them.
>
>
> I have this:
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network
>
> There are comments in there about multicast, etc.
>

Thank you for the links.  I will review them with our network folks,
hopefully it will help us sort out some of our issues.

I will use the fence_check tool to see if i can troubleshoot the fencing.

Thank you very much for all of your suggestions.

>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From lists at alteeve.ca  Mon Dec  1 18:24:00 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 01 Dec 2014 13:24:00 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <CACMA5-xVGQQUHj6-bdRQMtiHXp1T+X7XaYmGWnQ2KEPVip+i0w@mail.gmail.com>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>	<547C8FDF.1070000@alteeve.ca>	<CACMA5-xzK4nF7JkfchKY732bzC+nZ3EJ9ga5Yp0NHxn5toEmqg@mail.gmail.com>	<547CA1EE.9030001@alteeve.ca>
	<CACMA5-xVGQQUHj6-bdRQMtiHXp1T+X7XaYmGWnQ2KEPVip+i0w@mail.gmail.com>
Message-ID: <547CB240.4030800@alteeve.ca>

On 01/12/14 01:03 PM, Megan . wrote:
> We have 11 10-20TB GFS2 mounts that I need to share across all nodes.
> Its the only reason we went with the cluster solution.  I don't know
> how we could split it up into different smaller clusters.

I would do this, personally;

2-Node cluster; DRBD (on top of local disks or a pair of SANs, one per 
node), exported over NFS and configured in a simple single-primary 
(master/slave) configuration with a floating IP.

GFS2, like any clustered filesystem, requires cluster locking. This 
locking comes with a non-trivial overhead. Exporting NFS allows you to 
avoid this bottle-neck and with a simple 2-node cluster behind the 
scenes, you maintain full HA.

In HA, nothing is more important than simplicity. Said another way;

"A cluster isn't beautiful when there is nothing left to add. It is 
beautiful when there is nothing left to take away."

> On Mon, Dec 1, 2014 at 12:14 PM, Digimer <lists at alteeve.ca> wrote:
>> On 01/12/14 11:56 AM, Megan . wrote:
>>>
>>> Thank you for your replies.
>>>
>>> The cluster is intended to be 9 nodes, but i haven't finished building
>>> the remaining 2.  Our production cluster is expected to be similar in
>>> size.  What tuning should I be looking at?
>>>
>>>
>>> Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
>>> remove IP addresses.
>>
>>
>> Can you simplify those fencedevice definitions? I would wonder if the set
>> timeouts could be part of the problem. Always start with the simplest
>> possible configurations and only add options in response to actual issues
>> discovered in testing.
>
> I can try to simplify.  I had the longer timeouts because what I saw
> happening on the physical boxes, was the box would be on its way
> down/up and the fence command would fail, but the box actually did
> come back online.  The physicals take 10-15 minutes to reboot and i
> wasn't sure how to handle timeout issues, so i made the timeouts a bit
> extreme for testing. I'll try to make the config more vanilla for
> troubleshooting.

I'm not really sure why the state of the node should impact the fence 
action in any way. Fencing is supposed to work, regardless of the state 
of the target.

Fencing works like this (with a default config, on most fence agents);

1. Force off
2. Verify off
3. Try to boot, don't care if it succeeds.

So once the node is confirmed off by the agent, the fence is considered 
a success. How long (if at all) it takes for the node to reboot does not 
factor in.

>>> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
>>> the cluster kept seeing it as online and never fenced it, yet i could
>>> no longer ssh to the node.  I did this on a physical and VM box with
>>> the same result.  I had to fence_node node to get it to reboot, but it
>>> came up split brained (thinking it was the only one online). Now that
>>> node has cman down and the rest of the cluster sees it as still
>>> online.
>>
>>
>> Then corosync failed to detect the fault. That is a sign, to me, of a
>> fundamental network or configuration issue. Corosync should have shown
>> messages about a node being lost and reconfiguring. If that didn't happen,
>> then you're not even up to the point where fencing factors in.
>>
>> Did you configure corosync.conf? When it came up, did it think it was
>> quorate or inquorate?
>
> corosync.conf didn't work since it seems the RedHat HA Cluster doesn't
> use that file.  http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf
>   I tried it since we wanted to try to put the multicast traffic on a
> different bond/vlan but we figured out the file isn't used.

Right, I wanted to make sure that, if you had tried, you've since 
removed the corosync.conf entirely. Corosync is fully controlled by the 
cman cluster.conf file.

>>> I thought fencing was working because i'm able to do fence_node node
>>> and see the box reboot and come back online.  I did have to get the FC
>>> version of the fence_agents because of an issue with the idrac agent
>>> not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64
>>
>>
>> That tells you that the configuration of the fence agents is working, but it
>> doesn't test failure detection. You can use the 'fence_check' tool to see if
>> the cluster can talk to everything, but in the end, the only useful test is
>> to simulate an actual crash.
>>
>> Wait; 'fc14' ?! What OS are you using?
>
> We are Centos 6.6.  I went with the fedora core agents because of this
> exact issue http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7
>   I read that it was fixed in the next version, which i could only find
> for FC.

It would be *much* better to file a bug report 
(https://bugzilla.redhat.com/enter_bug.cgi?product=Red%20Hat%20Enterprise%20Linux%206) 
-> Version: 6.6 -> Component: fence-agents

Mixing RPMs from other OSes is not a good idea at all.

>>> fence_tool dump worked on one of my nodes, but it is just hanging on the
>>> rest.
>>>
>>> [root at map1-uat ~]# fence_tool dump
>>> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
>>> /var/log/cluster/fenced.log
>>> 1417448610 fenced 3.0.12.1 started
>>> 1417448610 connected to dbus :1.12
>>> 1417448610 cluster node 1 added seq 89048
>>> 1417448610 cluster node 2 added seq 89048
>>> 1417448610 cluster node 3 added seq 89048
>>> 1417448610 cluster node 4 added seq 89048
>>> 1417448610 cluster node 5 added seq 89048
>>> 1417448610 cluster node 6 added seq 89048
>>> 1417448610 cluster node 8 added seq 89048
>>> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
>>> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
>>> /var/log/cluster/fenced.log
>>> 1417448611 logfile cur mode 100644
>>> 1417448611 cpg_join fenced:daemon ...
>>> 1417448621 daemon cpg_join error retrying
>>> 1417448631 daemon cpg_join error retrying
>>> 1417448641 daemon cpg_join error retrying
>>> 1417448651 daemon cpg_join error retrying
>>> 1417448661 daemon cpg_join error retrying
>>> 1417448671 daemon cpg_join error retrying
>>> 1417448681 daemon cpg_join error retrying
>>> 1417448691 daemon cpg_join error retrying
>>> .
>>> .
>>> .
>>>
>>>
>>> [root at map1-uat ~]# clustat
>>> Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
>>> Member Status: Quorate
>>>
>>>    Member Name                                                     ID
>>> Status
>>>    ------ ----                                                     ----
>>> ------
>>>    archive1-uat.project.domain.com                                1 Online
>>>    admin1-uat.project.domain.com                                  2 Online
>>>    mgmt1-uat.project.domain.com                                   3 Online
>>>    map1-uat.project.domain.com                                    4 Online,
>>> Local
>>>    map2-uat.project.domain.com                                    5 Online
>>>    cache1-uat.project.domain.com                                 6 Online
>>>    data1-uat.project.domain.com                                   8 Online
>>>
>>>
>>> The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
>>> 16:02:34 fenced cpg_join error retrying every 10th of a second.
>>>
>>> Obviously having some major issues.  These are fresh boxes, no other
>>> services right now other then ones related to the cluster.
>>
>>
>> What OS/version?
>>
>>> I've also experimented with the  <cman transport="udpu"/> to disable
>>> multicast to see if that helped but it doesn't seem to make a
>>> difference with the node stability.
>>
>>
>> Very bad idea with >2~3 node clusters. The overhead will be far too great
>> for a 7~9 node cluster.
>>
>>> Is there a document or some sort of reference that I can give the
>>> network folks on how the switches should be configured?  I read stuff
>>> on boards about IGMP snooping, but I couldn't find anything from
>>> RedHat to hand them.
>>
>>
>> I have this:
>>
>> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F
>>
>> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches
>>
>> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations
>>
>> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network
>>
>> There are comments in there about multicast, etc.
>>
>
> Thank you for the links.  I will review them with our network folks,
> hopefully it will help us sort out some of our issues.
>
> I will use the fence_check tool to see if i can troubleshoot the fencing.
>
> Thank you very much for all of your suggestions.

Happy to help. :)

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


From dan131riley at gmail.com  Mon Dec  1 20:14:40 2014
From: dan131riley at gmail.com (Dan Riley)
Date: Mon, 1 Dec 2014 15:14:40 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <547CB240.4030800@alteeve.ca>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
	<547C8FDF.1070000@alteeve.ca>
	<CACMA5-xzK4nF7JkfchKY732bzC+nZ3EJ9ga5Yp0NHxn5toEmqg@mail.gmail.com>
	<547CA1EE.9030001@alteeve.ca>
	<CACMA5-xVGQQUHj6-bdRQMtiHXp1T+X7XaYmGWnQ2KEPVip+i0w@mail.gmail.com>
	<547CB240.4030800@alteeve.ca>
Message-ID: <734DC560-2D19-4C74-BD88-6E82C0D210FF@gmail.com>


> On Dec 1, 2014, at 13:24, Digimer <lists at alteeve.ca> wrote:
> GFS2, like any clustered filesystem, requires cluster locking. This locking comes with a non-trivial overhead. Exporting NFS allows you to avoid this bottle-neck and with a simple 2-node cluster behind the scenes, you maintain full HA.

We have a few small GFS2 file systems and one largish (2TB) one.  The small ones are fine, the large one is a pain.  We're in the process of converting the large one to XFS with NFS (the backend for this is all iSCSI devices).  For our application, NFSv4 makes this possible, as it provides much better consistency properties than the previous versions.

>>> On 01/12/14 11:56 AM, Megan . wrote:
>>>> fence_tool dump worked on one of my nodes, but it is just hanging on the
>>>> rest.

Well, that's not good.  I'd have to look at the fence tool source to even figure out how it could be blocking.

>>>> 
>>>> [root at map1-uat ~]# fence_tool dump
>>>> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
>>>> /var/log/cluster/fenced.log
>>>> 1417448610 fenced 3.0.12.1 started
>>>> 1417448610 connected to dbus :1.12
>>>> 1417448610 cluster node 1 added seq 89048
>>>> 1417448610 cluster node 2 added seq 89048
>>>> 1417448610 cluster node 3 added seq 89048
>>>> 1417448610 cluster node 4 added seq 89048
>>>> 1417448610 cluster node 5 added seq 89048
>>>> 1417448610 cluster node 6 added seq 89048
>>>> 1417448610 cluster node 8 added seq 89048
>>>> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
>>>> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
>>>> /var/log/cluster/fenced.log
>>>> 1417448611 logfile cur mode 100644
>>>> 1417448611 cpg_join fenced:daemon ...
>>>> 1417448621 daemon cpg_join error retrying
>>>> 1417448631 daemon cpg_join error retrying
>>>> 1417448641 daemon cpg_join error retrying
>>>> 1417448651 daemon cpg_join error retrying
>>>> 1417448661 daemon cpg_join error retrying
>>>> 1417448671 daemon cpg_join error retrying
>>>> 1417448681 daemon cpg_join error retrying
>>>> 1417448691 daemon cpg_join error retrying

And that looks the fence group is failing the membership transition.  Nothing else will work properly if the fence group is busted.

-dan


From mgrac at redhat.com  Mon Dec  1 20:17:35 2014
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Mon, 01 Dec 2014 21:17:35 +0100
Subject: [Linux-cluster] fence-agents-4.0.13 stable release
Message-ID: <547CCCDF.2010107@redhat.com>

Welcome to the fence-agents 4.0.13 release

This release includes some new features and several bugfixes:

* new fence agent based on mpathpersist that offers better handling of 
multipath devices
* improve support of fence_ilo_ssh on older firmwares

* required packages are also required by autoconf during build time
* fence_zvm now supports action 'monitor' (thanks to Neale Ferguson)
* introduce --gnutlscli-path --sudo-path --ssh-path and --telnet-path; 
they are no longer hard-coded
* order of XML parameters or options in --help is more stable now
* fence_cisco_ucs did not set protocol prefix correctly with 
--ssl-(in)secure
* logging to syslog now works correctly also with \x00 as input
<https://git.fedorahosted.org/cgit/fence-agents.git/commit/?id=7a04e60e463888fd1a6737cfbb636fbcf8de2d3a>
The new source tarball can be downloaded here:

https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.13.tar.xz 


To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
    experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

m,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141201/e1201768/attachment.htm>

From ccaulfie at redhat.com  Tue Dec  2 08:46:29 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 02 Dec 2014 08:46:29 +0000
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
Message-ID: <547D7C65.8010104@redhat.com>

On 01/12/14 14:16, Megan . wrote:
> Good Day,
>
> I'm fairly new to the cluster world so i apologize in advance for
> silly questions.  Thank you for any help.
>
> We decided to use this cluster solution in order to share GFS2 mounts
> across servers.  We have a 7 node cluster that is newly setup, but
> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
> with Idracs).  They are all running Centos 6.6.  I have fencing
> working (I'm able to do fence_node node and it will fence with
> success).  I do not have the gfs2 mounts in the cluster yet.
>
> When I don't touch the servers, my cluster looks perfect with all
> nodes online. But when I start testing fencing, I have an odd problem
> where i end up with split brain between some of the nodes.  They won't
> seem to automatically fence each other when it gets like this.
>
> in the  corosync.log for the node that gets split out i see the totem
> chatter, but it seems confused and just keeps doing the below over and
> over:
>
>
> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c
>
> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>
> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 21 23 24 25 26 27 28 29 2a 2b 32
> ..
> ..
> ..
> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>
> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>

These messages are the key to your problem and nothing will be fixed 
until you can get rid of them. As Digimer said they are often caused by 
a congested network, but it could also be multicast traffic not being 
passed between nodes - a mix of physical and virtual nodes could easily 
be contributing to this. The easiest way to prove this (and get the 
system working possibly) is to switch from multicast to normal UDP 
unicast traffic

<cman transport="udpu"/>

in cluster.conf. You'll need to to this on all nodes and reboot the 
whole cluster. All in all, it's probably easier that messing around 
checking routers, switches and kernel routing paramaters in a 
mixed-mode cluster!

Chrissie


From info at innova-studios.com  Tue Dec  2 10:29:04 2014
From: info at innova-studios.com (=?iso-8859-1?Q?J=FCrgen_Ladst=E4tter?=)
Date: Tue, 2 Dec 2014 11:29:04 +0100
Subject: [Linux-cluster] Fencing and dead locks
Message-ID: <065601d00e1a$cc3ca750$64b5f5f0$@innova-studios.com>

Hi guys,

 
we?re running a 9 node cluster with 5 gfs2 mounts. The cluster is mainly
used for load balancing web based applications. Fencing is done with IPMI
and works.

Sometimes one server gets fenced, but after rebooting isn?t able to rejoin
the cluster. This triggers higher load and many open processes, leading to
another server being fenced. This server then isn?t able to rejoin either
and this continues until we lose quorum and have to manually restart the
whole cluster.

Sadly this is not reproducible, but it looks like it happens more often when
there is more write IO.

 
Since a whole cluster deadlock kinda removes the sense of a cluster, we?d
need some input what we could do or change.

We?re running Centos 6.6, kernel 2.6.32-504.1.3.el6.x86_64

 
Did anyone of you test gfs2 with centos 7? Any known major bugs that could
cause dead locks? 

 
Thanks in advance, J?rgen

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141202/af810cb9/attachment.htm>

From nagemnna at gmail.com  Tue Dec  2 14:04:18 2014
From: nagemnna at gmail.com (Megan .)
Date: Tue, 2 Dec 2014 09:04:18 -0500
Subject: [Linux-cluster] new cluster acting odd
In-Reply-To: <547D7C65.8010104@redhat.com>
References: <CACMA5-yx0Bu1_i97-Ws4bW8hGFcTpR+-nEwnzXc4YN7GxxmvCA@mail.gmail.com>
	<547D7C65.8010104@redhat.com>
Message-ID: <CACMA5-wpRJdgqv70CpWFfXmj0=4jo8xJDEK1+JHf5S9Rd-evqw@mail.gmail.com>

Ok, thank you.

I did try this at one point and it didn't seem to have an impact.  but
I will try again and try some of the debugging commands provided by
others in this thread.

Thank you again for your help.


On Tue, Dec 2, 2014 at 3:46 AM, Christine Caulfield <ccaulfie at redhat.com> wrote:
> On 01/12/14 14:16, Megan . wrote:
>>
>> Good Day,
>>
>> I'm fairly new to the cluster world so i apologize in advance for
>> silly questions.  Thank you for any help.
>>
>> We decided to use this cluster solution in order to share GFS2 mounts
>> across servers.  We have a 7 node cluster that is newly setup, but
>> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
>> with Idracs).  They are all running Centos 6.6.  I have fencing
>> working (I'm able to do fence_node node and it will fence with
>> success).  I do not have the gfs2 mounts in the cluster yet.
>>
>> When I don't touch the servers, my cluster looks perfect with all
>> nodes online. But when I start testing fencing, I have an odd problem
>> where i end up with split brain between some of the nodes.  They won't
>> seem to automatically fence each other when it gets like this.
>>
>> in the  corosync.log for the node that gets split out i see the totem
>> chatter, but it seems confused and just keeps doing the below over and
>> over:
>>
>>
>> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>>
>> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 21 23 24 25 26 27 28 29 2a 2b 32
>> ..
>> ..
>> ..
>> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>
> These messages are the key to your problem and nothing will be fixed until
> you can get rid of them. As Digimer said they are often caused by a
> congested network, but it could also be multicast traffic not being passed
> between nodes - a mix of physical and virtual nodes could easily be
> contributing to this. The easiest way to prove this (and get the system
> working possibly) is to switch from multicast to normal UDP unicast traffic
>
> <cman transport="udpu"/>
>
> in cluster.conf. You'll need to to this on all nodes and reboot the whole
> cluster. All in all, it's probably easier that messing around checking
> routers, switches and kernel routing paramaters in a mixed-mode cluster!
>
> Chrissie
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From jpokorny at redhat.com  Mon Dec  8 13:36:08 2014
From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=)
Date: Mon, 8 Dec 2014 14:36:08 +0100
Subject: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015
In-Reply-To: <540D853F.3090109@redhat.com>
References: <540D853F.3090109@redhat.com>
Message-ID: <20141208133608.GC18879@redhat.com>

Hello,

it occured to me that if you want to use the opportunity and double
as as tourist while being in Brno, it's about the right time to
consider reservations/ticket purchases this early.
At least in some cases it is a must, e.g., Villa Tugendhat:

http://rezervace.spilberk.cz/langchange.aspx?mrsname=&languageId=2&returnUrl=%2Flist

On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:
> DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.
> 
> My suggestion would be to have a 2 days dedicated HA summit the 4th and
> the 5th of February.

-- 
Jan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20141208/2a4397a2/attachment.sig>

From v.melnik at uplink.ua  Sat Dec 13 16:04:48 2014
From: v.melnik at uplink.ua (Vladimir Melnik)
Date: Sat, 13 Dec 2014 18:04:48 +0200
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be corrupted
Message-ID: <20141213160449.GS23175@shagomer.uplink.net.ua>

Dear colleagues,

I encountered some very strange issue and would be grateful if you share
your thoughts on that.

I have a qcow2-image that is located at gfs2 filesystem on a cluster.
The cluster works fine and there are dozens of other qcow2-images, but,
as I can see, one of images seems to be corrupted.

First of all, it has quite unusual size:
> stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
  File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
  Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
Device: fd06h/64774d    Inode: 220986752   Links: 1
Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-10-09 16:25:24.864877839 +0300
Modify: 2014-12-13 14:41:29.335603509 +0200
Change: 2014-12-13 15:52:35.986888549 +0200

By the way, I noticed that blocks' number looks rather okay.

Also qemu-img can't recognize it as an image:
> qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
file format: raw
virtual size: 6815746T (7493992262336241664 bytes)
disk size: 392G

Disk size, although, looks more reasonable: the image's size is really
should be about 300-400G, as I remember.

Alas, I can't do anything with this image. I can't check it by qemu-img,
neither I can convert it to the new image, as qemu-img can't do anything
with it:

> qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'

Any one have experienced the same issue? What do you think, is it qcow2
issue or a gfs2 issue? What would you do in similar situation?

Any ideas, hints and comments would be greatly appreciated.

Yes, I have snapshots, that's good, but wouldn't like to lose today's
changes to the data on that image. And I'm worried about the filesystem
at all: what if something goes wrong if I try to remove that file?

Thanks to all!


-- 
V.Melnik

P.S. I use CentOS-6 and I have these packages installed:
	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
	lvm2-cluster-2.02.100-8.el6.x86_64
	cman-3.0.12.1-59.el6_5.1.x86_64
	clusterlib-3.0.12.1-59.el6_5.1.x86_64
	kernel-2.6.32-431.5.1.el6.x86_64


From v.melnik at uplink.ua  Mon Dec 15 09:17:32 2014
From: v.melnik at uplink.ua (Vladimir Melnik)
Date: Mon, 15 Dec 2014 11:17:32 +0200
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be
 corrupted
In-Reply-To: <20141213160449.GS23175@shagomer.uplink.net.ua>
References: <20141213160449.GS23175@shagomer.uplink.net.ua>
Message-ID: <20141215091732.GA15853@shagomer.uplink.net.ua>

And one more question,

Is it safe to remove this file? What will happen if I try to run 'rm
/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak', won't it corrupt
other files?

Thanks.

On Sat, Dec 13, 2014 at 06:04:48PM +0200, Vladimir Melnik wrote:
> Dear colleagues,
> 
> I encountered some very strange issue and would be grateful if you share
> your thoughts on that.
> 
> I have a qcow2-image that is located at gfs2 filesystem on a cluster.
> The cluster works fine and there are dozens of other qcow2-images, but,
> as I can see, one of images seems to be corrupted.
> 
> First of all, it has quite unusual size:
> > stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>   File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>   Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
> Device: fd06h/64774d    Inode: 220986752   Links: 1
> Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2014-10-09 16:25:24.864877839 +0300
> Modify: 2014-12-13 14:41:29.335603509 +0200
> Change: 2014-12-13 15:52:35.986888549 +0200
> 
> By the way, I noticed that blocks' number looks rather okay.
> 
> Also qemu-img can't recognize it as an image:
> > qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> file format: raw
> virtual size: 6815746T (7493992262336241664 bytes)
> disk size: 392G
> 
> Disk size, although, looks more reasonable: the image's size is really
> should be about 300-400G, as I remember.
> 
> Alas, I can't do anything with this image. I can't check it by qemu-img,
> neither I can convert it to the new image, as qemu-img can't do anything
> with it:
> 
> > qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
> 
> Any one have experienced the same issue? What do you think, is it qcow2
> issue or a gfs2 issue? What would you do in similar situation?
> 
> Any ideas, hints and comments would be greatly appreciated.
> 
> Yes, I have snapshots, that's good, but wouldn't like to lose today's
> changes to the data on that image. And I'm worried about the filesystem
> at all: what if something goes wrong if I try to remove that file?
> 
> Thanks to all!
> 
> 
> -- 
> V.Melnik
> 
> P.S. I use CentOS-6 and I have these packages installed:
> 	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
> 	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
> 	lvm2-cluster-2.02.100-8.el6.x86_64
> 	cman-3.0.12.1-59.el6_5.1.x86_64
> 	clusterlib-3.0.12.1-59.el6_5.1.x86_64
> 	kernel-2.6.32-431.5.1.el6.x86_64
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
V.Melnik


From swhiteho at redhat.com  Mon Dec 15 09:23:47 2014
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 15 Dec 2014 09:23:47 +0000
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be
	corrupted
In-Reply-To: <20141215091732.GA15853@shagomer.uplink.net.ua>
References: <20141213160449.GS23175@shagomer.uplink.net.ua>
	<20141215091732.GA15853@shagomer.uplink.net.ua>
Message-ID: <548EA8A3.3000000@redhat.com>

Hi,

How did you generate the image in the first place? I don't know if we've 
ever really tested GFS2 with a qcow device underneath it - normally even 
in virt clusters the storage for GFS2 would be a real shared block 
device. Was this perhaps just a single node?

Have you checked the image with fsck.gfs2 ?

Steve.

On 15/12/14 09:17, Vladimir Melnik wrote:
> And one more question,
>
> Is it safe to remove this file? What will happen if I try to run 'rm
> /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak', won't it corrupt
> other files?
>
> Thanks.
>
> On Sat, Dec 13, 2014 at 06:04:48PM +0200, Vladimir Melnik wrote:
>> Dear colleagues,
>>
>> I encountered some very strange issue and would be grateful if you share
>> your thoughts on that.
>>
>> I have a qcow2-image that is located at gfs2 filesystem on a cluster.
>> The cluster works fine and there are dozens of other qcow2-images, but,
>> as I can see, one of images seems to be corrupted.
>>
>> First of all, it has quite unusual size:
>>> stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>    File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>>    Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
>> Device: fd06h/64774d    Inode: 220986752   Links: 1
>> Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>> Access: 2014-10-09 16:25:24.864877839 +0300
>> Modify: 2014-12-13 14:41:29.335603509 +0200
>> Change: 2014-12-13 15:52:35.986888549 +0200
>>
>> By the way, I noticed that blocks' number looks rather okay.
>>
>> Also qemu-img can't recognize it as an image:
>>> qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>> image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>> file format: raw
>> virtual size: 6815746T (7493992262336241664 bytes)
>> disk size: 392G
>>
>> Disk size, although, looks more reasonable: the image's size is really
>> should be about 300-400G, as I remember.
>>
>> Alas, I can't do anything with this image. I can't check it by qemu-img,
>> neither I can convert it to the new image, as qemu-img can't do anything
>> with it:
>>
>>> qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
>> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
>> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>>
>> Any one have experienced the same issue? What do you think, is it qcow2
>> issue or a gfs2 issue? What would you do in similar situation?
>>
>> Any ideas, hints and comments would be greatly appreciated.
>>
>> Yes, I have snapshots, that's good, but wouldn't like to lose today's
>> changes to the data on that image. And I'm worried about the filesystem
>> at all: what if something goes wrong if I try to remove that file?
>>
>> Thanks to all!
>>
>>
>> -- 
>> V.Melnik
>>
>> P.S. I use CentOS-6 and I have these packages installed:
>> 	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
>> 	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
>> 	lvm2-cluster-2.02.100-8.el6.x86_64
>> 	cman-3.0.12.1-59.el6_5.1.x86_64
>> 	clusterlib-3.0.12.1-59.el6_5.1.x86_64
>> 	kernel-2.6.32-431.5.1.el6.x86_64
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From v.melnik at uplink.ua  Mon Dec 15 09:54:42 2014
From: v.melnik at uplink.ua (Vladimir Melnik)
Date: Mon, 15 Dec 2014 11:54:42 +0200
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be
 corrupted
In-Reply-To: <548EA8A3.3000000@redhat.com>
References: <20141213160449.GS23175@shagomer.uplink.net.ua>
	<20141215091732.GA15853@shagomer.uplink.net.ua>
	<548EA8A3.3000000@redhat.com>
Message-ID: <20141215095442.GB15853@shagomer.uplink.net.ua>

Hi!

The qcow2 isn't inderneath it, we can assume it's an ordinary file on a
filesystem. Its' size was about 300-400 GB, but now size is
7493992262336241664 bytes and I don't understand how it's happened. I'd
like to remove it, but I worry about consequences. :(

On Mon, Dec 15, 2014 at 09:23:47AM +0000, Steven Whitehouse wrote:
> Hi,
> 
> How did you generate the image in the first place? I don't know if
> we've ever really tested GFS2 with a qcow device underneath it -
> normally even in virt clusters the storage for GFS2 would be a real
> shared block device. Was this perhaps just a single node?
> 
> Have you checked the image with fsck.gfs2 ?
> 
> Steve.
> 
> On 15/12/14 09:17, Vladimir Melnik wrote:
> >And one more question,
> >
> >Is it safe to remove this file? What will happen if I try to run 'rm
> >/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak', won't it corrupt
> >other files?
> >
> >Thanks.
> >
> >On Sat, Dec 13, 2014 at 06:04:48PM +0200, Vladimir Melnik wrote:
> >>Dear colleagues,
> >>
> >>I encountered some very strange issue and would be grateful if you share
> >>your thoughts on that.
> >>
> >>I have a qcow2-image that is located at gfs2 filesystem on a cluster.
> >>The cluster works fine and there are dozens of other qcow2-images, but,
> >>as I can see, one of images seems to be corrupted.
> >>
> >>First of all, it has quite unusual size:
> >>>stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> >>   File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
> >>   Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
> >>Device: fd06h/64774d    Inode: 220986752   Links: 1
> >>Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> >>Access: 2014-10-09 16:25:24.864877839 +0300
> >>Modify: 2014-12-13 14:41:29.335603509 +0200
> >>Change: 2014-12-13 15:52:35.986888549 +0200
> >>
> >>By the way, I noticed that blocks' number looks rather okay.
> >>
> >>Also qemu-img can't recognize it as an image:
> >>>qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> >>image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> >>file format: raw
> >>virtual size: 6815746T (7493992262336241664 bytes)
> >>disk size: 392G
> >>
> >>Disk size, although, looks more reasonable: the image's size is really
> >>should be about 300-400G, as I remember.
> >>
> >>Alas, I can't do anything with this image. I can't check it by qemu-img,
> >>neither I can convert it to the new image, as qemu-img can't do anything
> >>with it:
> >>
> >>>qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
> >>Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
> >>Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
> >>
> >>Any one have experienced the same issue? What do you think, is it qcow2
> >>issue or a gfs2 issue? What would you do in similar situation?
> >>
> >>Any ideas, hints and comments would be greatly appreciated.
> >>
> >>Yes, I have snapshots, that's good, but wouldn't like to lose today's
> >>changes to the data on that image. And I'm worried about the filesystem
> >>at all: what if something goes wrong if I try to remove that file?
> >>
> >>Thanks to all!
> >>
> >>
> >>-- 
> >>V.Melnik
> >>
> >>P.S. I use CentOS-6 and I have these packages installed:
> >>	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
> >>	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
> >>	lvm2-cluster-2.02.100-8.el6.x86_64
> >>	cman-3.0.12.1-59.el6_5.1.x86_64
> >>	clusterlib-3.0.12.1-59.el6_5.1.x86_64
> >>	kernel-2.6.32-431.5.1.el6.x86_64
> >>
> >>-- 
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
V.Melnik


From swhiteho at redhat.com  Mon Dec 15 09:59:02 2014
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 15 Dec 2014 09:59:02 +0000
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be
	corrupted
In-Reply-To: <20141215095442.GB15853@shagomer.uplink.net.ua>
References: <20141213160449.GS23175@shagomer.uplink.net.ua>	<20141215091732.GA15853@shagomer.uplink.net.ua>	<548EA8A3.3000000@redhat.com>
	<20141215095442.GB15853@shagomer.uplink.net.ua>
Message-ID: <548EB0E6.3050706@redhat.com>

Hi,

On 15/12/14 09:54, Vladimir Melnik wrote:
> Hi!
>
> The qcow2 isn't inderneath it, we can assume it's an ordinary file on a
> filesystem. Its' size was about 300-400 GB, but now size is
> 7493992262336241664 bytes and I don't understand how it's happened. I'd
> like to remove it, but I worry about consequences. :(
Ok, I think I see now... either way though, if you are unsure about 
whether there is a problem, then unmounting on all nodes and running 
fsck is the way to go. That should pick up any problems that there might 
be with the filesystem. If you have the ability to snapshot the storage, 
then you could run fsck on a snapshot in order to avoid so much downtime.

An odd file size should not, in and of itself cause any problems with 
removing a file, so it will only be an issue if other on-disk metadata 
is incorrect,

Steve.

> On Mon, Dec 15, 2014 at 09:23:47AM +0000, Steven Whitehouse wrote:
>> Hi,
>>
>> How did you generate the image in the first place? I don't know if
>> we've ever really tested GFS2 with a qcow device underneath it -
>> normally even in virt clusters the storage for GFS2 would be a real
>> shared block device. Was this perhaps just a single node?
>>
>> Have you checked the image with fsck.gfs2 ?
>>
>> Steve.
>>
>> On 15/12/14 09:17, Vladimir Melnik wrote:
>>> And one more question,
>>>
>>> Is it safe to remove this file? What will happen if I try to run 'rm
>>> /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak', won't it corrupt
>>> other files?
>>>
>>> Thanks.
>>>
>>> On Sat, Dec 13, 2014 at 06:04:48PM +0200, Vladimir Melnik wrote:
>>>> Dear colleagues,
>>>>
>>>> I encountered some very strange issue and would be grateful if you share
>>>> your thoughts on that.
>>>>
>>>> I have a qcow2-image that is located at gfs2 filesystem on a cluster.
>>>> The cluster works fine and there are dozens of other qcow2-images, but,
>>>> as I can see, one of images seems to be corrupted.
>>>>
>>>> First of all, it has quite unusual size:
>>>>> stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>>>    File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>>>>    Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
>>>> Device: fd06h/64774d    Inode: 220986752   Links: 1
>>>> Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>>>> Access: 2014-10-09 16:25:24.864877839 +0300
>>>> Modify: 2014-12-13 14:41:29.335603509 +0200
>>>> Change: 2014-12-13 15:52:35.986888549 +0200
>>>>
>>>> By the way, I noticed that blocks' number looks rather okay.
>>>>
>>>> Also qemu-img can't recognize it as an image:
>>>>> qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>>> image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>>> file format: raw
>>>> virtual size: 6815746T (7493992262336241664 bytes)
>>>> disk size: 392G
>>>>
>>>> Disk size, although, looks more reasonable: the image's size is really
>>>> should be about 300-400G, as I remember.
>>>>
>>>> Alas, I can't do anything with this image. I can't check it by qemu-img,
>>>> neither I can convert it to the new image, as qemu-img can't do anything
>>>> with it:
>>>>
>>>>> qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
>>>> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
>>>> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>>>>
>>>> Any one have experienced the same issue? What do you think, is it qcow2
>>>> issue or a gfs2 issue? What would you do in similar situation?
>>>>
>>>> Any ideas, hints and comments would be greatly appreciated.
>>>>
>>>> Yes, I have snapshots, that's good, but wouldn't like to lose today's
>>>> changes to the data on that image. And I'm worried about the filesystem
>>>> at all: what if something goes wrong if I try to remove that file?
>>>>
>>>> Thanks to all!
>>>>
>>>>
>>>> -- 
>>>> V.Melnik
>>>>
>>>> P.S. I use CentOS-6 and I have these packages installed:
>>>> 	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
>>>> 	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
>>>> 	lvm2-cluster-2.02.100-8.el6.x86_64
>>>> 	cman-3.0.12.1-59.el6_5.1.x86_64
>>>> 	clusterlib-3.0.12.1-59.el6_5.1.x86_64
>>>> 	kernel-2.6.32-431.5.1.el6.x86_64
>>>>
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From v.melnik at uplink.ua  Mon Dec 15 12:22:25 2014
From: v.melnik at uplink.ua (Vladimir Melnik)
Date: Mon, 15 Dec 2014 14:22:25 +0200
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be
 corrupted
In-Reply-To: <548EB0E6.3050706@redhat.com>
References: <20141213160449.GS23175@shagomer.uplink.net.ua>
	<20141215091732.GA15853@shagomer.uplink.net.ua>
	<548EA8A3.3000000@redhat.com>
	<20141215095442.GB15853@shagomer.uplink.net.ua>
	<548EB0E6.3050706@redhat.com>
Message-ID: <20141215122225.GC15853@shagomer.uplink.net.ua>

Thank you!

Is it safe if I run "fsck.gfs2 -n" without unmounting it?

On Mon, Dec 15, 2014 at 09:59:02AM +0000, Steven Whitehouse wrote:
> Hi,
> 
> On 15/12/14 09:54, Vladimir Melnik wrote:
> >Hi!
> >
> >The qcow2 isn't inderneath it, we can assume it's an ordinary file on a
> >filesystem. Its' size was about 300-400 GB, but now size is
> >7493992262336241664 bytes and I don't understand how it's happened. I'd
> >like to remove it, but I worry about consequences. :(
> Ok, I think I see now... either way though, if you are unsure about
> whether there is a problem, then unmounting on all nodes and running
> fsck is the way to go. That should pick up any problems that there
> might be with the filesystem. If you have the ability to snapshot
> the storage, then you could run fsck on a snapshot in order to avoid
> so much downtime.
> 
> An odd file size should not, in and of itself cause any problems
> with removing a file, so it will only be an issue if other on-disk
> metadata is incorrect,
> 
> Steve.
> 
> >On Mon, Dec 15, 2014 at 09:23:47AM +0000, Steven Whitehouse wrote:
> >>Hi,
> >>
> >>How did you generate the image in the first place? I don't know if
> >>we've ever really tested GFS2 with a qcow device underneath it -
> >>normally even in virt clusters the storage for GFS2 would be a real
> >>shared block device. Was this perhaps just a single node?
> >>
> >>Have you checked the image with fsck.gfs2 ?
> >>
> >>Steve.
> >>
> >>On 15/12/14 09:17, Vladimir Melnik wrote:
> >>>And one more question,
> >>>
> >>>Is it safe to remove this file? What will happen if I try to run 'rm
> >>>/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak', won't it corrupt
> >>>other files?
> >>>
> >>>Thanks.
> >>>
> >>>On Sat, Dec 13, 2014 at 06:04:48PM +0200, Vladimir Melnik wrote:
> >>>>Dear colleagues,
> >>>>
> >>>>I encountered some very strange issue and would be grateful if you share
> >>>>your thoughts on that.
> >>>>
> >>>>I have a qcow2-image that is located at gfs2 filesystem on a cluster.
> >>>>The cluster works fine and there are dozens of other qcow2-images, but,
> >>>>as I can see, one of images seems to be corrupted.
> >>>>
> >>>>First of all, it has quite unusual size:
> >>>>>stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> >>>>   File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
> >>>>   Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
> >>>>Device: fd06h/64774d    Inode: 220986752   Links: 1
> >>>>Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> >>>>Access: 2014-10-09 16:25:24.864877839 +0300
> >>>>Modify: 2014-12-13 14:41:29.335603509 +0200
> >>>>Change: 2014-12-13 15:52:35.986888549 +0200
> >>>>
> >>>>By the way, I noticed that blocks' number looks rather okay.
> >>>>
> >>>>Also qemu-img can't recognize it as an image:
> >>>>>qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> >>>>image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
> >>>>file format: raw
> >>>>virtual size: 6815746T (7493992262336241664 bytes)
> >>>>disk size: 392G
> >>>>
> >>>>Disk size, although, looks more reasonable: the image's size is really
> >>>>should be about 300-400G, as I remember.
> >>>>
> >>>>Alas, I can't do anything with this image. I can't check it by qemu-img,
> >>>>neither I can convert it to the new image, as qemu-img can't do anything
> >>>>with it:
> >>>>
> >>>>>qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
> >>>>Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
> >>>>Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
> >>>>
> >>>>Any one have experienced the same issue? What do you think, is it qcow2
> >>>>issue or a gfs2 issue? What would you do in similar situation?
> >>>>
> >>>>Any ideas, hints and comments would be greatly appreciated.
> >>>>
> >>>>Yes, I have snapshots, that's good, but wouldn't like to lose today's
> >>>>changes to the data on that image. And I'm worried about the filesystem
> >>>>at all: what if something goes wrong if I try to remove that file?
> >>>>
> >>>>Thanks to all!
> >>>>
> >>>>
> >>>>-- 
> >>>>V.Melnik
> >>>>
> >>>>P.S. I use CentOS-6 and I have these packages installed:
> >>>>	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
> >>>>	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
> >>>>	lvm2-cluster-2.02.100-8.el6.x86_64
> >>>>	cman-3.0.12.1-59.el6_5.1.x86_64
> >>>>	clusterlib-3.0.12.1-59.el6_5.1.x86_64
> >>>>	kernel-2.6.32-431.5.1.el6.x86_64
> >>>>
> >>>>-- 
> >>>>Linux-cluster mailing list
> >>>>Linux-cluster at redhat.com
> >>>>https://www.redhat.com/mailman/listinfo/linux-cluster
> >>-- 
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
V.Melnik


From swhiteho at redhat.com  Mon Dec 15 12:27:07 2014
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 15 Dec 2014 12:27:07 +0000
Subject: [Linux-cluster] The file on a GFS2-filesystem seems to be
	corrupted
In-Reply-To: <20141215122225.GC15853@shagomer.uplink.net.ua>
References: <20141213160449.GS23175@shagomer.uplink.net.ua>	<20141215091732.GA15853@shagomer.uplink.net.ua>	<548EA8A3.3000000@redhat.com>	<20141215095442.GB15853@shagomer.uplink.net.ua>	<548EB0E6.3050706@redhat.com>
	<20141215122225.GC15853@shagomer.uplink.net.ua>
Message-ID: <548ED39B.7000306@redhat.com>

Hi,

On 15/12/14 12:22, Vladimir Melnik wrote:
> Thank you!
>
> Is it safe if I run "fsck.gfs2 -n" without unmounting it?
It is better to unmount, as otherwise the view of the fs which fsck has 
will be different to that which the mount has, so there is no guarantee 
that fsck will give correct results without unmounting. So the short 
answer to your question is no,

Steve.

> On Mon, Dec 15, 2014 at 09:59:02AM +0000, Steven Whitehouse wrote:
>> Hi,
>>
>> On 15/12/14 09:54, Vladimir Melnik wrote:
>>> Hi!
>>>
>>> The qcow2 isn't inderneath it, we can assume it's an ordinary file on a
>>> filesystem. Its' size was about 300-400 GB, but now size is
>>> 7493992262336241664 bytes and I don't understand how it's happened. I'd
>>> like to remove it, but I worry about consequences. :(
>> Ok, I think I see now... either way though, if you are unsure about
>> whether there is a problem, then unmounting on all nodes and running
>> fsck is the way to go. That should pick up any problems that there
>> might be with the filesystem. If you have the ability to snapshot
>> the storage, then you could run fsck on a snapshot in order to avoid
>> so much downtime.
>>
>> An odd file size should not, in and of itself cause any problems
>> with removing a file, so it will only be an issue if other on-disk
>> metadata is incorrect,
>>
>> Steve.
>>
>>> On Mon, Dec 15, 2014 at 09:23:47AM +0000, Steven Whitehouse wrote:
>>>> Hi,
>>>>
>>>> How did you generate the image in the first place? I don't know if
>>>> we've ever really tested GFS2 with a qcow device underneath it -
>>>> normally even in virt clusters the storage for GFS2 would be a real
>>>> shared block device. Was this perhaps just a single node?
>>>>
>>>> Have you checked the image with fsck.gfs2 ?
>>>>
>>>> Steve.
>>>>
>>>> On 15/12/14 09:17, Vladimir Melnik wrote:
>>>>> And one more question,
>>>>>
>>>>> Is it safe to remove this file? What will happen if I try to run 'rm
>>>>> /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak', won't it corrupt
>>>>> other files?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Sat, Dec 13, 2014 at 06:04:48PM +0200, Vladimir Melnik wrote:
>>>>>> Dear colleagues,
>>>>>>
>>>>>> I encountered some very strange issue and would be grateful if you share
>>>>>> your thoughts on that.
>>>>>>
>>>>>> I have a qcow2-image that is located at gfs2 filesystem on a cluster.
>>>>>> The cluster works fine and there are dozens of other qcow2-images, but,
>>>>>> as I can see, one of images seems to be corrupted.
>>>>>>
>>>>>> First of all, it has quite unusual size:
>>>>>>> stat /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>>>>>    File: `/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>>>>>>    Size: 7493992262336241664     Blocks: 821710640  IO Block: 4096   regular file
>>>>>> Device: fd06h/64774d    Inode: 220986752   Links: 1
>>>>>> Access: (0744/-rwxr--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>>>>>> Access: 2014-10-09 16:25:24.864877839 +0300
>>>>>> Modify: 2014-12-13 14:41:29.335603509 +0200
>>>>>> Change: 2014-12-13 15:52:35.986888549 +0200
>>>>>>
>>>>>> By the way, I noticed that blocks' number looks rather okay.
>>>>>>
>>>>>> Also qemu-img can't recognize it as an image:
>>>>>>> qemu-img info /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>>>>> image: /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak
>>>>>> file format: raw
>>>>>> virtual size: 6815746T (7493992262336241664 bytes)
>>>>>> disk size: 392G
>>>>>>
>>>>>> Disk size, although, looks more reasonable: the image's size is really
>>>>>> should be about 300-400G, as I remember.
>>>>>>
>>>>>> Alas, I can't do anything with this image. I can't check it by qemu-img,
>>>>>> neither I can convert it to the new image, as qemu-img can't do anything
>>>>>> with it:
>>>>>>
>>>>>>> qemu-img convert -p -f qcow2 -O qcow2 /mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak /mnt/tmp/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0
>>>>>> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak': Invalid argument
>>>>>> Could not open '/mnt/sp1/ac2cb28f-09ac-4ca0-bde1-471e0c7276a0.bak'
>>>>>>
>>>>>> Any one have experienced the same issue? What do you think, is it qcow2
>>>>>> issue or a gfs2 issue? What would you do in similar situation?
>>>>>>
>>>>>> Any ideas, hints and comments would be greatly appreciated.
>>>>>>
>>>>>> Yes, I have snapshots, that's good, but wouldn't like to lose today's
>>>>>> changes to the data on that image. And I'm worried about the filesystem
>>>>>> at all: what if something goes wrong if I try to remove that file?
>>>>>>
>>>>>> Thanks to all!
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> V.Melnik
>>>>>>
>>>>>> P.S. I use CentOS-6 and I have these packages installed:
>>>>>> 	qemu-img-0.12.1.2-2.415.el6_5.4.x86_64
>>>>>> 	gfs2-utils-3.0.12.1-59.el6_5.1.x86_64
>>>>>> 	lvm2-cluster-2.02.100-8.el6.x86_64
>>>>>> 	cman-3.0.12.1-59.el6_5.1.x86_64
>>>>>> 	clusterlib-3.0.12.1-59.el6_5.1.x86_64
>>>>>> 	kernel-2.6.32-431.5.1.el6.x86_64
>>>>>>
>>>>>> -- 
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster


From keisuke.mori+ha at gmail.com  Mon Dec 22 08:35:10 2014
From: keisuke.mori+ha at gmail.com (Keisuke MORI)
Date: Mon, 22 Dec 2014 17:35:10 +0900
Subject: [Linux-cluster] [ha-wg-technical] [Pacemaker] [RFC] Organizing
	HA Summit 2015
In-Reply-To: <20141208133608.GC18879@redhat.com>
References: <540D853F.3090109@redhat.com> <20141208133608.GC18879@redhat.com>
Message-ID: <CAJM6Fh_GufBob3=sgTkFz7gYj8nNK648VJGgvxe=O+7y5m8e-g@mail.gmail.com>

Hi all,

Really late response but,
I will be joining the HA summit, with a few colleagues from NTT.

See you guys in Brno,
Thanks,


2014-12-08 22:36 GMT+09:00 Jan Pokorn? <jpokorny at redhat.com>:
> Hello,
>
> it occured to me that if you want to use the opportunity and double
> as as tourist while being in Brno, it's about the right time to
> consider reservations/ticket purchases this early.
> At least in some cases it is a must, e.g., Villa Tugendhat:
>
> http://rezervace.spilberk.cz/langchange.aspx?mrsname=&languageId=2&returnUrl=%2Flist
>
> On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:
>> DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.
>>
>> My suggestion would be to have a 2 days dedicated HA summit the 4th and
>> the 5th of February.
>
> --
> Jan
>
> _______________________________________________
> ha-wg-technical mailing list
> ha-wg-technical at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical
>


-- 
Keisuke MORI


From lists at alteeve.ca  Mon Dec 22 17:13:06 2014
From: lists at alteeve.ca (Digimer)
Date: Mon, 22 Dec 2014 12:13:06 -0500
Subject: [Linux-cluster] [ha-wg-technical] [Pacemaker] [RFC] Organizing
	HA Summit 2015
In-Reply-To: <CAJM6Fh_GufBob3=sgTkFz7gYj8nNK648VJGgvxe=O+7y5m8e-g@mail.gmail.com>
References: <540D853F.3090109@redhat.com> <20141208133608.GC18879@redhat.com>
	<CAJM6Fh_GufBob3=sgTkFz7gYj8nNK648VJGgvxe=O+7y5m8e-g@mail.gmail.com>
Message-ID: <54985122.5020907@alteeve.ca>

It will be very nice to see you again! Will Ikeda-san be there as well?

digimer

On 22/12/14 03:35 AM, Keisuke MORI wrote:
> Hi all,
>
> Really late response but,
> I will be joining the HA summit, with a few colleagues from NTT.
>
> See you guys in Brno,
> Thanks,
>
>
> 2014-12-08 22:36 GMT+09:00 Jan Pokorn? <jpokorny at redhat.com>:
>> Hello,
>>
>> it occured to me that if you want to use the opportunity and double
>> as as tourist while being in Brno, it's about the right time to
>> consider reservations/ticket purchases this early.
>> At least in some cases it is a must, e.g., Villa Tugendhat:
>>
>> http://rezervace.spilberk.cz/langchange.aspx?mrsname=&languageId=2&returnUrl=%2Flist
>>
>> On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:
>>> DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.
>>>
>>> My suggestion would be to have a 2 days dedicated HA summit the 4th and
>>> the 5th of February.
>>
>> --
>> Jan
>>
>> _______________________________________________
>> ha-wg-technical mailing list
>> ha-wg-technical at lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical
>>
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?